PySpark average function

In this article, we will show how average function works in PySpark. avg() is an aggregate function which is used to get the average value from the dataframe column/s. We can get average value in three ways.

Let's go through one by one.

First let us create the dataframe for demonstration.

Step 1: Creating a dataframe for the demonstration.

CopiedCopy Code
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#display
input_dataframe.show()

Output:

+-------+-------+------------+--------+
|   cost|food_id|        item|quantity|
+-------+-------+------------+--------+
| 234.89|    112|      onions|       4|
|  17.39|    113|      potato|       1|
| 4234.9|    102|      grains|      84|
|  10.89|     98|shampoo/soap|       2|
| 100.89|     98|shampoo/soap|      20|
|1234.89|     98|shampoo/soap|      94|
| 170.39|    113|      potato|      10|
|  34.39|    113|      potato|       2|
| 1000.9|    102|      grains|      24|
|  134.0|     56|         oil|      10|
+-------+-------+------------+--------+

Avg(): Using select() method

select() method is used to select the average value from the dataframe columns. It can take single or multiple columns at a time.

It will take avg() function as parameter.

But, we have to import avg function from pyspark.sql.functions

Output:

dataframe.select(avg('column1'),............,avg('column n'))

were,

  1. dataframe is the input PySpark DataFrame
  2. column specifies the average value to be returned

Example:

In this example will use avg function on cost and quantity columns.

CopiedCopy Code
import pyspark
from pyspark.sql import SparkSession
#import avg function 
from pyspark.sql.functions import avg
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the average of cost and quantity  column
input_dataframe.select(avg('cost'),avg('quantity')).show()

Output:

+-----------------+-------------+
|        avg(cost)|avg(quantity)|
+-----------------+-------------+
|717.3530000000001|         25.1|

+-----------------+-------------+

avg() : Using agg() method

agg() stands for aggragation which is used to select the average value from the dataframe columns. It will take a dictinary as a parameter in which key will be the column name in the dataframe and value represents the aggregate function name that is avg.we can specify multiple columns to apply the aggregate function

Syntax:

dataframe.agg({'column1': 'avg',......,'column n':'avg'})

were,

  1. dataframe is the input PySpark DataFrame
  2. column specifies the average value to be returned

Example:

In this example will use avg function on cost and quantity columns.

CopiedCopy Code
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the average of cost and quantity column
input_dataframe.agg({'cost': 'avg','quantity':'avg'}).show()

Output:

+-----------------+-------------+
|        avg(cost)|avg(quantity)|
+-----------------+-------------+
|717.3530000000001|         25.1|
+-----------------+-------------+

avg() : Using groupBy() with avg()

If we want to get the average based on values in a group we have to use groupBy() function. This will group the values which are similar in a column and return the average based on group.

Syntax:

dataframe.groupBy('group_column').avg('column')

were,

  1. dataframe is the input PySpark DataFrame
  2. group_column is the column where values are grouped
  3. column specifies the average value to be returned

Example:

Python program to get average by grouping the item column with cost

CopiedCopy Code
import pyspark
from pyspark.sql import SparkSession
# create the app name GKINDEX
app = SparkSession.builder.appName('GKINDEX').getOrCreate()
# create grocery data with 5 items with 4 attributes
grocery_data =[{'food_id':112,'item':'onions','cost':234.89,'quantity':4},
               {'food_id':113,'item':'potato','cost':17.39,'quantity':1},
               {'food_id':102,'item':'grains','cost':4234.9,'quantity':84},
                {'food_id':98,'item':'shampoo/soap','cost':10.89,'quantity':2},
                {'food_id':98,'item':'shampoo/soap','cost':100.89,'quantity':20},
               {'food_id':98,'item':'shampoo/soap','cost':1234.89,'quantity':94},
               {'food_id':113,'item':'potato','cost':170.39,'quantity':10},
               {'food_id':113,'item':'potato','cost':34.39,'quantity':2},
               {'food_id':102,'item':'grains','cost':1000.9,'quantity':24},
               {'food_id':56,'item':'oil','cost':134.00,'quantity':10}]
# creating a dataframe from the grocery_data
input_dataframe = app.createDataFrame( grocery_data)
#get the average of cost column groued by item
input_dataframe.groupBy('item').avg('cost').show()

Output:

+------------+------------------+
|        item|         avg(cost)|
+------------+------------------+
|      grains|2617.8999999999996|
|      onions|            234.89|
|      potato| 74.05666666666666|
|shampoo/soap|448.89000000000004|
|         oil|             134.0|
+------------+------------------+