{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Central tendency" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import math\n", "import scipy.stats\n", "\n", "food = pd.read_pickle(\"../data/processed/food\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the first things we usually want to do is to explore and describe our data, before we begin any detailed analysis.\n", "Measures of central tendency are one of the first things we use to describe our data.\n", "\n", "Measures of central tendency is a fancy phrase for 'average'.\n", "They are a single data point used to represent a 'typical' value from your data.\n", "Depending on your level of measurement you can use one or more measures of central tendency.\n", "\n", "![There is more than one measure of central tendency](_images/average-meme.jpg)\n", "\n", "## Mode\n", "\n", "The most common value.\n", "Mode is the only measure of central tendency you can provide for nominal data.\n", "\n", "For example, the variable `A121r` in our food data set is of household tenure type.\n", "The available options are:\n", "\n", "1. public rented (i.e. rented from a council)\n", "2. private rented (i.e. rented from a landlord)\n", "3. owned\n", "\n", "A frequency (count) table of this variable shows that `owned` is the most common type of tenure:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "food[\"A121r\"] = food[\"A121r\"].astype(\"category\")\n", "food[\"A121r\"].cat.categories = [\"public rented\", \"private rented\", \"owned\"]\n", "food[\"A121r\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Median\n", "\n", "The median is the 'middle' point.\n", "It's only appropriate for ordered data (i.e. ordinal or numeric) and is calculated by arranging your data in order and selecting the mid--point.\n", "`P344pr` is the gross normal weekly household income for each respondent. The following are incomes for the first five respondents as an example:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "food[\"P344pr\"].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The variable looks like this when we plot it as a distribution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "food.hist(\"P344pr\", bins = 100)\n", "plt.xlabel(\"Weekly income (£)\")\n", "plt.ylabel(\"Frequency\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we arrange these in order and take the middle point we obtain the median income:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "food[\"P344pr\"].median()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your data have an even number of items, the median is the mean (average) of the two middle points.\n", "For example, using the following example data with four data points - 2, 4, 6, 8 - there is no one middle point.\n", "Instead 4 and 6 are the middle points.\n", "The median is the mean of these, which is $\\frac{(4 + 6)}{2} = 5$.\n", "\n", "The median is often considered more *robust* than the mean, which means it is less susceptible to outliers, for reasons we'll get to in a moment." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Mean\n", "\n", "The mean is what most people think of when they think of an average.\n", "You simply \"add them all up and divide by how many you have\".\n", "For example, the mean of the incomes is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "food.P344pr.mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the incomes were an ideal normal distribution, the mean and the median (and mode) would be identical (more on the [normal distribution](#Normal-distribution) later).\n", "In the wild, most distributions are not exactly normal (or ideal) so the mean and the median differ, as we have seen with our example data.\n", "\n", "If there are outliers in our data set these can affect the mean up or down.\n", "For example, if there are a few individuals in our data that are substantially wealthier than most this can affect the mean.\n", "They also affect the median, but not as much as the mean.\n", "For this reason we often consider the median a more *robust* measure of central tendency than the mean, and why you should be careful when someone presents a mean value without any additional information." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }