{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Building Data\n",
"\n",
"This notebook discusses how to *build* a series or data frame, possibly using pieces of other series or data frames, to either make new columns for our data or set up a new data frame.\n",
"\n",
"We are going to use the [HETREC MovieLens data](https://grouplens.org/datasets/hetrec-2011/) again."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup and Data Load\n",
"\n",
"Load our Python modules:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the movie data:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.countplot(scifi['era'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"But this era is in the wrong order! That's no good. We can fix that by making the `era` an *ordered categorical*. We often leave categorical variables as strings, unless we have a very large number of data points, but we don't get control over their order.\n",
"\n",
"Let's make it a categorical variable:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"era = era.astype('category')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And set the category order:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEGCAYAAACKB4k+AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAWbklEQVR4nO3de7SddX3n8feHEIGKVWIONJDQME5qBatRI16YriI4A2W1g7Zg44wWHWfFzkIrXZ3pgHYq2skaHG9jbXE1LBHqqBiLF7y1UgZqESQGjEBANEsQYiIE72gbTfjOH8/vPGyTfZKTkH3OSc77tdZe+3l+z2V/z372Pp/9XPZvp6qQJAngoOkuQJI0cxgKkqSeoSBJ6hkKkqSeoSBJ6h083QU8GvPnz6/FixdPdxmStF+5+eabH6yqsWHT9utQWLx4MWvXrp3uMiRpv5LkmxNN8/CRJKlnKEiSeoaCJKlnKEiSeoaCJKlnKEiSeoaCJKlnKEiSeiMLhSSHJlmT5CtJ1id5U2u/MMm3kqxrtzMGlrkgyYYkdyU5bVS1SZKGG+U3mrcCp1TVQ0nmAtcn+Wyb9s6qetvgzEmOB5YDJwBHA/+Q5FeqavsIa9R+4t43/9p0l3DAO/bPbpvuEjQDjGxPoToPtdG57barn3k7E7iiqrZW1d3ABuDEUdUnSdrZSM8pJJmTZB3wAHB1Vd3UJr0mya1JLk1yRGs7BrhvYPGNrU2SNEVGGgpVtb2qlgILgROTPBV4D/AkYCmwGXh7mz3DVrFjQ5IVSdYmWbtly5YRVS5Js9OUXH1UVd8HrgNOr6r7W1g8DFzCI4eINgKLBhZbCGwasq5VVbWsqpaNjQ3t+VWStJdGefXRWJIntOHDgBcCX02yYGC2FwO3t+GrgOVJDklyHLAEWDOq+iRJOxvl1UcLgMuTzKELn9VV9akk70+ylO7Q0D3AqwGqan2S1cAdwDbgXK88kqSpNbJQqKpbgWcMaX/5LpZZCawcVU2SpF3zG82SpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpN7IQiHJoUnWJPlKkvVJ3tTa5yW5OsnX2/0RA8tckGRDkruSnDaq2iRJw41yT2ErcEpVPR1YCpye5LnA+cA1VbUEuKaNk+R4YDlwAnA6cHGSOSOsT5K0g5GFQnUeaqNz262AM4HLW/vlwIva8JnAFVW1taruBjYAJ46qPknSzkZ6TiHJnCTrgAeAq6vqJuCoqtoM0O6PbLMfA9w3sPjG1rbjOlckWZtk7ZYtW0ZZviTNOiMNharaXlVLgYXAiUmeuovZM2wVQ9a5qqqWVdWysbGxfVWqJIkpuvqoqr4PXEd3ruD+JAsA2v0DbbaNwKKBxRYCm6aiPklSZ5RXH40leUIbPgx4IfBV4CrgnDbbOcAn2vBVwPIkhyQ5DlgCrBlVfZKknR08wnUvAC5vVxAdBKyuqk8luRFYneRVwL3A2QBVtT7JauAOYBtwblVtH2F9kqQdjCwUqupW4BlD2r8DnDrBMiuBlaOqSZK0a36jWZLUMxQkST1DQZLUMxQkST1DQZLUMxQkST1DQZLUMxQkST1DQZLUMxQkST1DQZLUMxQkST1DQZLUMxQkST1DQZLUMxQkST1DQZLUMxQkST1DQZLUMxQkSb2RhUKSRUmuTXJnkvVJXtfaL0zyrSTr2u2MgWUuSLIhyV1JThtVbZKk4Q4e4bq3AX9cVbckeRxwc5Kr27R3VtXbBmdOcjywHDgBOBr4hyS/UlXbR1ijJGnAyPYUqmpzVd3Shn8E3Akcs4tFzgSuqKqtVXU3sAE4cVT1SZJ2NiXnFJIsBp4B3NSaXpPk1iSXJjmitR0D3Dew2EaGhEiSFUnWJlm7ZcuWEVYtSbPPyEMhyeHAlcB5VfVD4D3Ak4ClwGbg7eOzDlm8dmqoWlVVy6pq2djY2IiqlqTZaaShkGQuXSB8oKo+ClBV91fV9qp6GLiERw4RbQQWDSy+ENg0yvokST9vlFcfBXgvcGdVvWOgfcHAbC8Gbm/DVwHLkxyS5DhgCbBmVPVJknY2yquPTgJeDtyWZF1rez3w0iRL6Q4N3QO8GqCq1idZDdxBd+XSuV55JElTa2ShUFXXM/w8wWd2scxKYOWoapIk7ZrfaJYk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVJvlL+8JkkAnPTuk6a7hAPeF177hX2yHvcUJEk9Q0GS1DMUJEm9SYVCkmsm07bD9EVJrk1yZ5L1SV7X2ucluTrJ19v9EQPLXJBkQ5K7kpy2p3+MJOnR2WUoJDk0yTxgfpIj2j/0eUkWA0fvZt3bgD+uqqcAzwXOTXI8cD5wTVUtAa5p47Rpy4ETgNOBi5PM2fs/TZK0p3Z39dGrgfPoAuBmIK39h8Bf7WrBqtoMbG7DP0pyJ3AMcCZwcpvtcuA64L+39iuqaitwd5INwInAjXv0F0mS9touQ6Gq3gW8K8lrq+rde/sgbc/iGcBNwFEtMKiqzUmObLMdA3xxYLGNrW3Hda0AVgAce+yxe1uSJGmISX1PoareneT5wOLBZarqb3a3bJLDgSuB86rqh0kmnHXYQw+pZRWwCmDZsmU7TZck7b1JhUKS9wNPAtYB21tzAbsMhSRz6QLhA1X10dZ8f5IFbS9hAfBAa98ILBpYfCGwaVJ/hSRpn5jsN5qXAcdX1aQ/mafbJXgvcGdVvWNg0lXAOcBF7f4TA+0fTPIOunMYS4A1k308SdKjN9lQuB34JdqJ40k6CXg5cFuSda3t9XRhsDrJq4B7gbMBqmp9ktXAHXRXLp1bVdt3Xq0kaVQmGwrzgTuSrAG2jjdW1b+faIGqup7h5wkATp1gmZXAyknWJEnaxyYbCheOsghJ0sww2auP/nHUhUiSpt9krz76EY9cHvoYYC7w46r6xVEVJkmaepPdU3jc4HiSF9F921iSdADZq15Sq+rjwCn7uBZJ0jSb7OGj3xkYPYjuewt+m1iSDjCTvfrotweGtwH30HVgJ0k6gEz2nMIrR12IJGn6TfZHdhYm+ViSB5Lcn+TKJAtHXZwkaWpN9kTz++j6JjqarjvrT7Y2SdIBZLLnFMaqajAELkty3igKGpVn/bfd9vKtfeDmt/7+dJcg6VGY7J7Cg0lelmROu70M+M4oC5MkTb3JhsJ/Al4CfJuup9SzAE8+S9IBZrKHj/4cOKeqvgeQZB7wNrqwkCQdICa7p/C08UAAqKrv0v3msiTpADLZUDgoyRHjI21PYbJ7GZKk/cRk/7G/Hbghyd/SdW/xEvwxHEk64Ez2G81/k2QtXSd4AX6nqu4YaWWSpCk36UNALQQMAkk6gO1V19mSpAPTyEIhyaWtr6TbB9ouTPKtJOva7YyBaRck2ZDkriSnjaouSdLERrmncBlw+pD2d1bV0nb7DECS44HlwAltmYuTzBlhbZKkIUYWClX1eeC7k5z9TOCKqtpaVXcDG/DnPiVpyk3HOYXXJLm1HV4a/+7DMcB9A/NsbG07SbIiydoka7ds2TLqWiVpVpnqUHgP8CRgKV0fSm9v7Rky79Cf+6yqVVW1rKqWjY2NjaZKSZqlpjQUqur+qtpeVQ8Dl/DIIaKNwKKBWRcCm6ayNknSFIdCkgUDoy8Gxq9MugpYnuSQJMcBS4A1U1mbJGmE/Rcl+RBwMjA/yUbgjcDJSZbSHRq6B3g1QFWtT7Ka7stx24Bzq2r7qGqTJA03slCoqpcOaX7vLuZfif0pSdK08hvNkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6hkKkqSeoSBJ6o0sFJJcmuSBJLcPtM1LcnWSr7f7IwamXZBkQ5K7kpw2qrokSRMb5Z7CZcDpO7SdD1xTVUuAa9o4SY4HlgMntGUuTjJnhLVJkoYYWShU1eeB7+7QfCZweRu+HHjRQPsVVbW1qu4GNgAnjqo2SdJwU31O4aiq2gzQ7o9s7ccA9w3Mt7G17STJiiRrk6zdsmXLSIuVpNlmppxozpC2GjZjVa2qqmVVtWxsbGzEZUnS7DLVoXB/kgUA7f6B1r4RWDQw30Jg0xTXJkmz3lSHwlXAOW34HOATA+3LkxyS5DhgCbBmimuTpFnv4FGtOMmHgJOB+Uk2Am8ELgJWJ3kVcC9wNkBVrU+yGrgD2AacW1XbR1WbJGm4kYVCVb10gkmnTjD/SmDlqOqRJO3eTDnRLEmaAQwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLPUJAk9QwFSVLv4Ol40CT3AD8CtgPbqmpZknnAh4HFwD3AS6rqe9NRnyTNVtO5p/CCqlpaVcva+PnANVW1BLimjUuSptBMOnx0JnB5G74ceNE01iJJs9J0hUIBn0tyc5IVre2oqtoM0O6PnKbaJGnWmpZzCsBJVbUpyZHA1Um+OtkFW4isADj22GNHVZ8kzUrTsqdQVZva/QPAx4ATgfuTLABo9w9MsOyqqlpWVcvGxsamqmRJmhWmPBSSPDbJ48aHgX8H3A5cBZzTZjsH+MRU1yZJs910HD46CvhYkvHH/2BV/V2SLwGrk7wKuBc4expqk6RZbcpDoaq+ATx9SPt3gFOnuh5J0iNm0iWpkqRpZihIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpZyhIknqGgiSpN+NCIcnpSe5KsiHJ+dNdjyTNJjMqFJLMAf4K+E3geOClSY6f3qokafaYUaEAnAhsqKpvVNVPgSuAM6e5JkmaNVJV011DL8lZwOlV9Z/b+MuB51TVawbmWQGsaKNPBu6a8kKnznzgwekuQnvN7bf/OtC33S9X1diwCQdPdSW7kSFtP5daVbUKWDU15UyvJGuratl016G94/bbf83mbTfTDh9tBBYNjC8ENk1TLZI068y0UPgSsCTJcUkeAywHrprmmiRp1phRh4+qaluS1wB/D8wBLq2q9dNc1nSaFYfJDmBuv/3XrN12M+pEsyRpes20w0eSpGlkKEiSeobCPpbkuiSn7dB2XpJv7K7bjiSLk/yH0VaoQUnekGR9kluTrEvynNZ+XpJfeBTrfUKS7yRJG39ekkqysI0/Psl3k/genKQk29s2uj3JR/Z0++zu/ZXksiQ/SfK4gbZ3te02fzfrfv1upn8myRP2pN7p4gty3/sQ3VVTg5YD51TVRbtZdjFgKEyRJM8Dfgt4ZlU9DXghcF+bfB6wp/905owPV9X3gW8DT2lNzwe+3O4BngvcVFUP78l6Z7l/rqqlVfVU4KfAH+zh8ovZ/ftrA60XhRbYLwC+NYl1Dw2FdA6qqjPaa2LGMxT2vb8FfivJIdB9OgGOBv51kr9sbZcl+YskN7Q9iLPashcBv94+Df1R+2TzT0luabfnt+UPSnJx+4T7qfYp5Kw27VlJ/jHJzUn+PsmCKf779ycLgAeraitAVT1YVZuS/CHdNrs2ybUASd6TZG17zt80voIk9yT5syTXA2fvsP4v8EgIPB945w7jN+xiG5+c5NokHwRuS/LYJJ9O8pX2Sfn3RvOU7Df+ie49NS/Jx9ue3heTPA0gyW+099G6JF9un/5/7v01wXo/BIw/tyfTbcNt4xPbY93cXgcrWttFwGFtvR9o2/TOJBcDtwCL2utkfpJnt1oPbdt0fZKnjuIJ2mtV5W0f34BPA2e24fOBtwKvAP6ytV0GfIQulI+n6+8JuhfhpwbW8wvAoW14CbC2DZ8FfKYt/0vA91rbXOAGYKzN93t0l/VO+3MyE2/A4cA64GvAxcBvDEy7B5g/MD6v3c8BrgOeNjDfn0yw/leMP/90ewmHAte38auBU3axjU8Gfgwc18Z/F7hkYN2Pn+7nbxq210Pt/mDgE8B/Ad4NvLG1nwKsa8OfBE4a2M4H7/j+GrL+y9r76IvAEcAlwG8MvhYGXgeHAbcDTxysrQ0vBh4Gnjvs9QT8T+BtdJ1/XjDdz+uON/cURmPwENLyNr6jj1fVw1V1B3DUBOuZC1yS5Da6EBnvMfbfAB9py38buLa1Pxl4KnB1knXAn9J9K1xDVNVDwLPo+tLaAnw4ySsmmP0lSW6h++d+Ao9sC4APT7DMF4DnJzkOuKeq/oXuiMLh7XHXMPE2BlhTVXe34duAFyZ5S5Jfr6of7OGfeyA4rL2u1wL3Au+ley+8H6Cq/h/wxCSPp3vu39H2+p5QVdsmWOcwH6V73z6Hbo9k0B8m+QpdcCyiC/JhvllVX5xg2puBfwssA/73HtQ1JWbUl9cOIB+ne0E+Ezisqm4Z360dsHVgeFifTwB/BNwPPJ1ur+BfdjN/gPVV9by9K3v2qartdJ/8r2v/mM+h+8TYa//U/yvw7Kr6XpLL6D71j/vxBOv+epIjgN8GbmzNNwOvBO6uqoeSXMjwbfxz662qryV5FnAG8L+SfK6q3rw3f/N+7J+raulgQ5Kh/aVV1UVJPk33fH0xyQt3nCnJ+4BnAJuq6oyBSVfQHfa5vKoeHn+IJCfTnXd6XlX9JMl1/PzrYNDQ10Qzj27vZW5bflfzTjn3FEagfQK9DriU4XsJE/kR8LiB8ccDm6s7GflyukMXANcDv9vOLRxFt1sMXY+xY+0EKknmJjlhb/+OA12SJycZ/KS3FPhmGx7cFr9I98b9QXu+f3MPHuZG4HU8Ego30p3EvqGNT7SNd6z1aOAnVfV/6Q49PHMPajiQfR74j9D/036wqn6Y5ElVdVtVvYVuz+JX2eH9VVWvrO7E9WAgUFX3Am+gO6Q46PHA91og/CrdxQLjfpZk7iRrXgX8D+ADwFsmucyUcU9hdD7EI7uhk3UrsK3tnl5G96K8MsnZdIeIxj9RXAmcSndM82vATcAPquqn7YTzX7Rd6IOB/wPM5q5CduVw4N3pLhXcRnflyXi37KuAzybZXFUvSPJluufxG3SHJibrC3SfVte28RuBf8UjoTDRNt7RrwFvTfIw8DO64+mCC4H3JbkV+Andnh7AeUleAGwH7gA+S3ecv39/VdU7J1ppVf31kOa/A/6gPdZddIeQxq0Cbm2HGN8w0XqT/D6wrao+mO6qshuSnNIOfc0IdnOxn0pyeDv88ES6Y9MntfMLkrTX3FPYf32qfcJ9DPDnBoKkfcE9BUlSzxPNkqSeoSBJ6hkKkqSeoSBJ6hkK0gjFHk61nzEUpEchycuSrGk9ZP51kjlJHkry5iQ3Ac9rvah+qfVuumqCrhmkGcFQkPZSkqfQ9UR7UuuTZztdlwuPBW6vqudU1fV0veM+u7rfATiM7jccpBnJL69Je+9Uut5Ov9Q+/B8GPEAXDlcOzPeCJH9C1032PLruMj45taVKk+OX16S9lOS1wNFVdcEO7Q9V1eFt+FC6TvaWVdV9rVdUqurCKS5XmhQPH0l77xrgrCRHArRfAfvlHeYZ71r5wfY7CmchzWAePpL2UlXdkeRPgc+l+z3fnwHn7jDP95NcQvcjOfcAX5ryQqU94OEjSVLPw0eSpJ6hIEnqGQqSpJ6hIEnqGQqSpJ6hIEnqGQqSpN7/B5dcg9lf4hRwAAAAAElFTkSuQmCC\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"era = era.cat.reorder_categories(['Vintage', 'Star Wars', 'Post-Matrix'])\n",
"scifi['era'] = era\n",
"sns.countplot(scifi['era'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating Data Frames\n",
"\n",
"In this section, we're going to see how to create data frames!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### From One Series\n",
"\n",
"We can create a data frame from *one* series by using the [`to_frame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html) method, as seen in [Reshaping](Reshaping.ipynb).\n",
"\n",
"Let's get a series:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1\n",
"1 2\n",
"2 3\n",
"3 4\n",
"4 5\n",
" ... \n",
"10192 65088\n",
"10193 65091\n",
"10194 65126\n",
"10195 65130\n",
"10196 65133\n",
"Name: id, Length: 10197, dtype: int64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"id_s = movies['id']\n",
"id_s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And convert it to a frame:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
movieID
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
10192
\n",
"
65088
\n",
"
\n",
"
\n",
"
10193
\n",
"
65091
\n",
"
\n",
"
\n",
"
10194
\n",
"
65126
\n",
"
\n",
"
\n",
"
10195
\n",
"
65130
\n",
"
\n",
"
\n",
"
10196
\n",
"
65133
\n",
"
\n",
" \n",
"
\n",
"
10197 rows × 1 columns
\n",
"
"
],
"text/plain": [
" movieID\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 4\n",
"4 5\n",
"... ...\n",
"10192 65088\n",
"10193 65091\n",
"10194 65126\n",
"10195 65130\n",
"10196 65133\n",
"\n",
"[10197 rows x 1 columns]"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mids = id_s.to_frame('movieID')\n",
"mids"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This **preserves the index**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### From Multiple Series\n",
"\n",
"If we have multiple series, with the same or compatible indexes, we can create a data frame from a *dictionary* mapping column names to series:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
movieID
\n",
"
title
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
1
\n",
"
Toy story
\n",
"
\n",
"
\n",
"
1
\n",
"
2
\n",
"
Jumanji
\n",
"
\n",
"
\n",
"
2
\n",
"
3
\n",
"
Grumpy Old Men
\n",
"
\n",
"
\n",
"
3
\n",
"
4
\n",
"
Waiting to Exhale
\n",
"
\n",
"
\n",
"
4
\n",
"
5
\n",
"
Father of the Bride Part II
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
10192
\n",
"
65088
\n",
"
Bedtime Stories
\n",
"
\n",
"
\n",
"
10193
\n",
"
65091
\n",
"
Manhattan Melodrama
\n",
"
\n",
"
\n",
"
10194
\n",
"
65126
\n",
"
Choke
\n",
"
\n",
"
\n",
"
10195
\n",
"
65130
\n",
"
Revolutionary Road
\n",
"
\n",
"
\n",
"
10196
\n",
"
65133
\n",
"
Blackadder Back & Forth
\n",
"
\n",
" \n",
"
\n",
"
10197 rows × 2 columns
\n",
"
"
],
"text/plain": [
" movieID title\n",
"0 1 Toy story\n",
"1 2 Jumanji\n",
"2 3 Grumpy Old Men\n",
"3 4 Waiting to Exhale\n",
"4 5 Father of the Bride Part II\n",
"... ... ...\n",
"10192 65088 Bedtime Stories\n",
"10193 65091 Manhattan Melodrama\n",
"10194 65126 Choke\n",
"10195 65130 Revolutionary Road\n",
"10196 65133 Blackadder Back & Forth\n",
"\n",
"[10197 rows x 2 columns]"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"titles = pd.DataFrame({\n",
" 'movieID': movies['id'],\n",
" 'title': movies['title']\n",
"})\n",
"titles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use NumPy arrays or Python lists instead of Pandas series, so long as they are all the same length.\n",
"\n",
"We can also provide an index (with `index=`) in the data frame constructor."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### From a List of Rows\n",
"\n",
"One more way to create a data frame is from a list (or any iterable) of rows, where each row is either a tuple or a dictionary.\n",
"The [`from_records`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.from_records.html) function does this.\n",
"\n",
"If the rows are dictionaries, their keys are used as column names; if they are tuples, you can specify a column name with `columns=['name1', 'name2']`.\n",
"\n",
"One very common source of data like this is when we are reading data that comes to us in a list or file of JSON objects, or some other source of dictionaries (such as the census data, or a MongoDB connection).\n",
"\n",
"For example, the [Rent the Runway data](https://cseweb.ucsd.edu/~jmcauley/datasets.html#clothing_fit) comes as a GZIP-compressed list of JSON objects, one per line. We can read this into a list of JSON objects like this:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'fit': 'fit',\n",
" 'user_id': '420272',\n",
" 'bust size': '34d',\n",
" 'item_id': '2260466',\n",
" 'weight': '137lbs',\n",
" 'rating': '10',\n",
" 'rented for': 'vacation',\n",
" 'review_text': \"An adorable romper! Belt and zipper were a little hard to navigate in a full day of wear/bathroom use, but that's to be expected. Wish it had pockets, but other than that-- absolutely perfect! I got a million compliments.\",\n",
" 'body type': 'hourglass',\n",
" 'review_summary': 'So many compliments!',\n",
" 'category': 'romper',\n",
" 'height': '5\\' 8\"',\n",
" 'size': 14,\n",
" 'age': '28',\n",
" 'review_date': 'April 20, 2016'}"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import json, gzip\n",
"with gzip.open('renttherunway_final_data.json.gz', 'r') as zf:\n",
" rtr_records = [json.loads(line) for line in zf]\n",
"\n",
"# dump first value of list\n",
"rtr_records[0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then use `from_records` to make a data frame:"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 192544 entries, 0 to 192543\n",
"Data columns (total 15 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 fit 192544 non-null object\n",
" 1 user_id 192544 non-null object\n",
" 2 bust size 174133 non-null object\n",
" 3 item_id 192544 non-null object\n",
" 4 weight 162562 non-null object\n",
" 5 rating 192462 non-null object\n",
" 6 rented for 192534 non-null object\n",
" 7 review_text 192544 non-null object\n",
" 8 body type 177907 non-null object\n",
" 9 review_summary 192544 non-null object\n",
" 10 category 192544 non-null object\n",
" 11 height 191867 non-null object\n",
" 12 size 192544 non-null int64 \n",
" 13 age 191584 non-null object\n",
" 14 review_date 192544 non-null object\n",
"dtypes: int64(1), object(14)\n",
"memory usage: 22.0+ MB\n"
]
}
],
"source": [
"rtr = pd.DataFrame.from_records(rtr_records)\n",
"rtr.info()"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
fit
\n",
"
user_id
\n",
"
bust size
\n",
"
item_id
\n",
"
weight
\n",
"
rating
\n",
"
rented for
\n",
"
review_text
\n",
"
body type
\n",
"
review_summary
\n",
"
category
\n",
"
height
\n",
"
size
\n",
"
age
\n",
"
review_date
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
fit
\n",
"
420272
\n",
"
34d
\n",
"
2260466
\n",
"
137lbs
\n",
"
10
\n",
"
vacation
\n",
"
An adorable romper! Belt and zipper were a lit...
\n",
"
hourglass
\n",
"
So many compliments!
\n",
"
romper
\n",
"
5' 8\"
\n",
"
14
\n",
"
28
\n",
"
April 20, 2016
\n",
"
\n",
"
\n",
"
1
\n",
"
fit
\n",
"
273551
\n",
"
34b
\n",
"
153475
\n",
"
132lbs
\n",
"
10
\n",
"
other
\n",
"
I rented this dress for a photo shoot. The the...
\n",
"
straight & narrow
\n",
"
I felt so glamourous!!!
\n",
"
gown
\n",
"
5' 6\"
\n",
"
12
\n",
"
36
\n",
"
June 18, 2013
\n",
"
\n",
"
\n",
"
2
\n",
"
fit
\n",
"
360448
\n",
"
NaN
\n",
"
1063761
\n",
"
NaN
\n",
"
10
\n",
"
party
\n",
"
This hugged in all the right places! It was a ...
\n",
"
NaN
\n",
"
It was a great time to celebrate the (almost) ...
\n",
"
sheath
\n",
"
5' 4\"
\n",
"
4
\n",
"
116
\n",
"
December 14, 2015
\n",
"
\n",
"
\n",
"
3
\n",
"
fit
\n",
"
909926
\n",
"
34c
\n",
"
126335
\n",
"
135lbs
\n",
"
8
\n",
"
formal affair
\n",
"
I rented this for my company's black tie award...
\n",
"
pear
\n",
"
Dress arrived on time and in perfect condition.
\n",
"
dress
\n",
"
5' 5\"
\n",
"
8
\n",
"
34
\n",
"
February 12, 2014
\n",
"
\n",
"
\n",
"
4
\n",
"
fit
\n",
"
151944
\n",
"
34b
\n",
"
616682
\n",
"
145lbs
\n",
"
10
\n",
"
wedding
\n",
"
I have always been petite in my upper body and...
\n",
"
athletic
\n",
"
Was in love with this dress !!!
\n",
"
gown
\n",
"
5' 9\"
\n",
"
12
\n",
"
27
\n",
"
September 26, 2016
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" fit user_id bust size item_id weight rating rented for \\\n",
"0 fit 420272 34d 2260466 137lbs 10 vacation \n",
"1 fit 273551 34b 153475 132lbs 10 other \n",
"2 fit 360448 NaN 1063761 NaN 10 party \n",
"3 fit 909926 34c 126335 135lbs 8 formal affair \n",
"4 fit 151944 34b 616682 145lbs 10 wedding \n",
"\n",
" review_text body type \\\n",
"0 An adorable romper! Belt and zipper were a lit... hourglass \n",
"1 I rented this dress for a photo shoot. The the... straight & narrow \n",
"2 This hugged in all the right places! It was a ... NaN \n",
"3 I rented this for my company's black tie award... pear \n",
"4 I have always been petite in my upper body and... athletic \n",
"\n",
" review_summary category height size \\\n",
"0 So many compliments! romper 5' 8\" 14 \n",
"1 I felt so glamourous!!! gown 5' 6\" 12 \n",
"2 It was a great time to celebrate the (almost) ... sheath 5' 4\" 4 \n",
"3 Dress arrived on time and in perfect condition. dress 5' 5\" 8 \n",
"4 Was in love with this dress !!! gown 5' 9\" 12 \n",
"\n",
" age review_date \n",
"0 28 April 20, 2016 \n",
"1 36 June 18, 2013 \n",
"2 116 December 14, 2015 \n",
"3 34 February 12, 2014 \n",
"4 27 September 26, 2016 "
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"rtr.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We would then need to go convert a lot of data types, but we have the data!\n",
"\n",
"> **Note:** Pandas also provides functions to read JSON lines from a file. But `from_records` has a lot of uses."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 4
}