PySpark - max() function
In this post, we will discuss about max() function in PySpark, max() is an aggregate function which is used to get the maximum value from the dataframe column/s. We can get maximum value in three ways Let us see one by one.
Let us create the dataframe for demonstration.
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
{'food_id':113,'item':'potato','cost':17.39,'quantity':1},
{'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
{'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
{'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
{'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
{'food_id':113,'item':'potato','cost':170.39,'quantity':10},
{'food_id':113,'item':'potato','cost':34.39,'quantity':2},
{'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
{'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display
input_dataframe.show()
Output:
+-------+-------+------------+--------+
| cost|food_id| item|quantity|
+-------+-------+------------+--------+
| 234.89| 112| onions| 4|
| 17.39| 113| potato| 1|
| 4234.9| 102| grains| 84|
| 10.89| 98|shampoo/soap| 2|
| 100.89| 98|shampoo/soap| 20|
|1234.89| 98|shampoo/soap| 94|
| 170.39| 113| potato| 10|
| 34.39| 113| potato| 2|
| 1000.9| 102| grains| 24|
| 134.0| 56| oil| 10|
+-------+-------+------------+--------+
Method - 1 : Using select() method
select() method is used to select the maximum value from the dataframe columns. It can take single or multiple columns at a time. It will take max() function as parameter. But, we must import max function from pyspark.sql.functions. Syntax:
dataframe.select(max('column1'),............,max('column n'))
were,
- dataframe is the input PySpark DataFrame
- column specifies the max value to be returned
Example:
In this example will use max function on cost and quantity columns.
import pyspark
from pyspark.sql import SparkSession
#import max function
from pyspark.sql.functions import max
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
{'food_id':113,'item':'potato','cost':17.39,'quantity':1},
{'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
{'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
{'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
{'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
{'food_id':113,'item':'potato','cost':170.39,'quantity':10},
{'food_id':113,'item':'potato','cost':34.39,'quantity':2},
{'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
{'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the maximum of cost and quantity column
input_dataframe.select(max('cost'),max('quantity')).show()
Output:
+---------+-------------+
|max(cost)|max(quantity)|
+---------+-------------+
| 4234.9| 94|
+---------+-------------+
Method - 2 : Using agg() method
agg() stands for aggragation which is used to select the maximum value from the dataframe columns. It will take a dictinary as a parameter in which key will be the column name in the dataframe and value represents the aggregate function name that is max. we can specify multiple columns to apply the aggregate function Syntax:
dataframe.agg({'column1': 'max',......,'column n':'max'})
were,
dataframe is the input PySpark DataFrame
column specifies the max value to be returned
Example:
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
{'food_id':113,'item':'potato','cost':17.39,'quantity':1},
{'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
{'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
{'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
{'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
{'food_id':113,'item':'potato','cost':170.39,'quantity':10},
{'food_id':113,'item':'potato','cost':34.39,'quantity':2},
{'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
{'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the maximum of cost and quantity column
input_dataframe.agg({'cost': 'max','quantity':'max'}).show()
Output:
+---------+-------------+
|max(cost)|max(quantity)|
+---------+-------------+
| 4234.9| 94|
+---------+-------------+
Method - 3 : Using groupBy() with max()
If we want to get the maximum value based on values in a group we have to use groupBy() function. This will group the values which are similar in a column and return the maximum value based on group. Syntax:
dataframe.groupBy('group_column').max('column')
were,
- dataframe is the input dataframe
- group_column is the column where values are grouped
- column is the column name to get maximum value based on group_column
Example:
Python program to get maximum value by grouping the item column with cost
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
{'food_id':113,'item':'potato','cost':17.39,'quantity':1},
{'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
{'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
{'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
{'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
{'food_id':113,'item':'potato','cost':170.39,'quantity':10},
{'food_id':113,'item':'potato','cost':34.39,'quantity':2},
{'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
{'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the maximum of cost column groued by item
input_dataframe.groupBy('item').max('cost').show()
Output:
+------------+---------+
| item|max(cost)|
+------------+---------+
| grains| 4234.9|
| onions| 234.89|
| potato| 170.39|
|shampoo/soap| 1234.89|
| oil| 134.0|
+------------+---------+