PySpark Introduction

In this article, we will discuss about PySpark and how to create a Dataframe in PySpark.

Introduction to Bigdata

Bigdata is one of the trending technologies in today's world. With out data we cannot survive.

Day by Day a lot of data is generating and we must process this data. Some many technologies came and processed data, but processing data with more efficiently is important. And, we must utilize very low resources on this data. One of the Best options that Bigdata provides is SPARK.

Spark provides best processing frameworks like apps etc in different programming languages like Java, Python , Scala. Spark is used to store and process the data in an efficient way Now we will discuss Spark Technology in Python

Python Provides Spark usage by providing a module known as PySpark

Let us discuss about PySpark If we want to use the PySpark, then we have to install it, we can do this by using pip command

Syntax:

Copied

pip install pyspark

Now, PySpark is ready to use. Let' see how to import PySpark and use it.

Step1: Import the PySpark module. We can do this by using import command. Syntax:

Copied

import pyspark

Step2: Create Spark App

We must create Spark app using name through spark session. So, we must create spark session and create the app. We can create the app by using getOrCreate() method

Syntax:

Copied

from pyspark.sql import SparkSession
app = SparkSession.builder.appName('app_name').getOrCreate()

We are ready with PySpark ! . Let's create the dataframe.

Before creating the dataframe we should know about the dataframe. A Dataframe is a data structure which will store the data in rows and columns. We can create a dataframe using a dictionary in which the key refers to the column name in the dataframe and value refers to the row in the dataframe.

In PySpark, we can create the dataframe by using createDataFrame() method

Syntax:

app_name.createDataFrame( dictionary)

If we want to display the dataframe, then we will use show() method, this will display the dataframe in tabular format

Example: Python Program to create dataframe from the grocery data

Copied

import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display the dataframe
input_dataframe.show()

Output:

+-------+-------+------------+--------+
|   cost|food_id|        item|quantity|
+-------+-------+------------+--------+
| 234.89|    112|      onions|       4|
|  17.39|    113|      potato|       1|
| 4234.9|    102|      grains|      84|
|1234.89|     98|shampoo/soap|      94|
|  134.0|     56|         oil|      10|
+-------+-------+------------+--------+

If we want to display the dataframe in row format, we have to use collect() method

Syntax:

dataframe.collect()

Example: Display PySpark DataFrame in Row format

Copied

import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display the dataframe
input_dataframe.collect()

Output:

[Row(cost=234.89, food_id=112, item='onions', quantity=4),
 Row(cost=17.39, food_id=113, item='potato', quantity=1),
 Row(cost=4234.9, food_id=102, item='grains', quantity=84),
 Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94),
 Row(cost=134.0, food_id=56, item='oil', quantity=10)]

If we want to get top rows then we can use head() method. We have to specify the number of rows to be displayed in the parameter

Syntax:

dataframe.head(n)
where, n is the number of rows.

Similarly, if we want to get last rows then we can use tail() method. We have to specify the number of rows to be displayed in the parameter

Syntax:

dataframe.tail(n)
where, n is the number of rows.

Example:�Display top and last rows

Copied

import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display 4 rows from top
print(input_dataframe.head(4))
#display last 2 rows
print(input_dataframe.tail(2))

Output:

[Row(cost=234.89, food_id=112, item='onions', quantity=4), Row(cost=17.39, food_id=113, item='potato', quantity=1), Row(cost=4234.9, food_id=102, item='grains', quantity=84), Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94)] [Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94), Row(cost=134.0, food_id=56, item='oil', quantity=10)]