PySpark Introduction
In this article, we will discuss about PySpark and how to create a Dataframe in PySpark.
Introduction to Bigdata
Bigdata is one of the trending technologies in today's world. With out data we cannot survive.
Day by Day a lot of data is generating and we must process this data. Some many technologies came and processed data, but processing data with more efficiently is important. And, we must utilize very low resources on this data. One of the Best options that Bigdata provides is SPARK.
Spark provides best processing frameworks like apps etc in different programming languages like Java, Python , Scala. Spark is used to store and process the data in an efficient way Now we will discuss Spark Technology in Python
Python Provides Spark usage by providing a module known as PySpark
Let us discuss about PySpark If we want to use the PySpark, then we have to install it, we can do this by using pip command
Syntax:
pip install pyspark
Now, PySpark is ready to use. Let' see how to import PySpark and use it.
Step1: Import the PySpark module. We can do this by using import command. Syntax:
import pyspark
Step2: Create Spark App
We must create Spark app using name through spark session. So, we must create spark session and create the app. We can create the app by using getOrCreate()
method
Syntax:
from pyspark.sql import SparkSession
app = SparkSession.builder.appName('app_name').getOrCreate()
We are ready with PySpark ! . Let's create the dataframe.
Before creating the dataframe we should know about the dataframe. A Dataframe is a data structure which will store the data in rows and columns. We can create a dataframe using a dictionary in which the key refers to the column name in the dataframe and value refers to the row in the dataframe.
In PySpark, we can create the dataframe by using createDataFrame() method
Syntax:
app_name.createDataFrame( dictionary)
If we want to display the dataframe, then we will use show() method, this will display the dataframe in tabular format
Example: Python Program to create dataframe from the grocery data
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
{'food_id':113,'item':'potato','cost':17.39,'quantity':1},
{'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
{'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
{'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display the dataframe
input_dataframe.show()
Output:
+-------+-------+------------+--------+
| cost|food_id| item|quantity|
+-------+-------+------------+--------+
| 234.89| 112| onions| 4|
| 17.39| 113| potato| 1|
| 4234.9| 102| grains| 84|
|1234.89| 98|shampoo/soap| 94|
| 134.0| 56| oil| 10|
+-------+-------+------------+--------+
If we want to display the dataframe in row format, we have to use collect() method
Syntax:
dataframe.collect()
Example: Display PySpark DataFrame in Row format
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
{'food_id':113,'item':'potato','cost':17.39,'quantity':1},
{'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
{'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
{'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display the dataframe
input_dataframe.collect()
Output:
[Row(cost=234.89, food_id=112, item='onions', quantity=4),
Row(cost=17.39, food_id=113, item='potato', quantity=1),
Row(cost=4234.9, food_id=102, item='grains', quantity=84),
Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94),
Row(cost=134.0, food_id=56, item='oil', quantity=10)]
If we want to get top rows then we can use head() method. We have to specify the number of rows to be displayed in the parameter
Syntax:
dataframe.head(n)
where, n is the number of rows.
Similarly, if we want to get last rows then we can use tail() method. We have to specify the number of rows to be displayed in the parameter
Syntax:
dataframe.tail(n)
where, n is the number of rows.
Example:�Display top and last rows
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
{'food_id':113,'item':'potato','cost':17.39,'quantity':1},
{'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
{'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
{'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display 4 rows from top
print(input_dataframe.head(4))
#display last 2 rows
print(input_dataframe.tail(2))
Output:
[Row(cost=234.89, food_id=112, item='onions', quantity=4), Row(cost=17.39, food_id=113, item='potato', quantity=1), Row(cost=4234.9, food_id=102, item='grains', quantity=84), Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94)] [Row(cost=1234.89, food_id=98, item='shampoo/soap', quantity=94), Row(cost=134.0, food_id=56, item='oil', quantity=10)]