CIS-3010 Lab Assignment #8: FoodTracker Project

Due: Friday, 2023-04-28

Reading: [TBD]


Table of Contents

  1. Introduction
  2. Requirements
  3. Implementation Notes
  4. Deliverables

In this lab you will design and implement a database that allows users to track the food they eat. Recording this information is useful for users who want to manage their weight or other health issues such as blood pressure or calcium intake. This document outlines the requirements of the system and discusses some implementation issues. Certain design details are left up to you.

Complex systems are often implemented in three tiers. Clients typically use a web browser (or mobile app) to access the system. JavaScript executing in the browser itself can provide some of the system's functionality. The middle tier is the web server or application server which executes a web application in PHP, Java, Python, or some other language. That program commonly accesses a database. The database may also include some stored procedures that provide some of the system's functionality. The web application executes the "business logic" of the overall application while the database manages the data storage.

One of the central design decisions that must be made when designing an overall system is how the system's functionality will be divided across the three tiers. Often there are many choices for how to do this, with each choice having its own set of advantages and disadvantages. Since this lab is part of a database course, we will make choices that emphasize the database. This is a good approach for a situation where the database has many clients besides just a web application (e.g., a desktop application or a mobile application that communicates directly with the database). Access control and other control logic implemented in the database is automatically available to all clients without having to replicate that logic in each client.


1. Introduction

We will make use of the USDA food nutrient data. This data is extensive and comes as several CSV files. The zip archive also contains a PDF document that describes the data and how it is organized. We will not need to use all of this data!. Also be sure the organization of this data does not bias your thinking when you design the database for this assignment. You want to adjust the data to fit your design, not adjust your design to fit the data.


2. Requirements

In this section the requirements of the system as a whole are enumerated.

Users

There are three user roles as described below.

  1. Administrator (food_admin). Administrators are able to manage the list of system users and also update the food database. They may also need to perform maintenance actions on other aspects of the database such as correcting/removing bad recipes (to name one example). In effect, administrators need to have full access to every aspect of the database.

  2. Ordinary (food_user). Ordinary Users track their food. They need to be able to record what they eat as well as input named recipes containing ingredients lists (with amounts) and number of servings.

  3. Researcher (food_researcher). Researchers want to access aggregated information that is stripped of identifying data (to protect the privacy of ordinary users) to research trends, correlations between foods and medical outcomes, and similar things.

For this lab, assume that the web application connects to the database as yourself, in your role as the system developer and owner of all database objects. Thus, the web application has full access to everything. Maintain an ordinary table in the database about all the users of your system and their roles as described above. When one of the users logs into the web application, the web application issues a SET ROLE command before interacting with the database on behalf of that user. Database privileges should be adjusted as appropriate to ensure the security requirements are enforced to the greatest degree feasible.

Some security requirements might need help from stored procedures or functions that have execute privilege set for the appropriate roles, but that are created as SECURITY DEFINER procedures. Such procedures have the authority of their owner when they execute and thus, in this situation, are able to access any database object in any way necessary.

Some other security requirements might need help from the web application. For example, users should only be able to see their own recipes. However, if a hypothetical recipe table is SELECTable by food_user, then, in theory, every user could see everyone else's recipe. The web application should still set its role to food_user before selecting from that table, but then only SELECT for the user that is currently logged in.

This approach requires that the database track a possibly large list of users along with their roles and hashed passwords. The web application will use that information to allow users to log in. Note that each user has a single role. If an individual uses the system as both an administrator and an ordinary user, for example, that person would have two distinct user accounts.

Data Requirements

The following requirements hold:

Security Requirements

The following requirements hold:


3. Implementation Notes

This section describes a number of issues related to the implementation, and suggestions related to the design of the system. Multiple implementation strategies are possible; do not feel bound by the suggestions here.

Database Diagram

When designing your database, you may find it useful to use a diagramming tool. For purposes of this lab you can either create a (detailed) database diagram, or you can write an SQL script in a text editor that defines all the tables and their relationships, or both.

Note that it can be useful to create an abbreviated diagram that just shows entities, and the relationships between them (perhaps with attribute names). Although not detailed enough to create tables from, such a diagram can still help you organize your thoughts.

Food Data

It is educational (and more satisfying) to work with realistic data rather than creating a small set of "fake" records just for testing purposes. Also, importing and cleaning real data creates a number of interesting issues that are worth experiencing and learning about. The process of getting real data sets to fit into your structured database is sometimes called "data wrangling." It is a non-trivial activity!

The United States Department of Agriculture provides data on foods for free download. As a convenience I provide the complete collection of CSV data files from that site (October 2022 data).

You will want to first import the USDA data into temporary tables. You do not need to import every CSV file provided by the USDA, so only focus on the files that contain the data you need. Start by creating a suitable temporary table for each CSV file you want to import. I recommend creating attributes in the temporary table with exactly the same names as in the USDA file, and in the same order. This will make it easier to understand what is going on. You might want to use different attribute names in your final database, but to make the import process as clear as possible, I'd stick with the USDA names.

Try to create reasonable data types for the attributes. This will require some guesswork, especially when it comes to the length of strings. If you don't get it right, PostgreSQL will fail the import because of type constraint issues. In that case no data will be imported at all since the entire import operation is a single transaction. Make the types as specific as you can; it will simplify processing the data later. In other words, don't try to treat everything as a string when some attributes are clearly numeric.

Ultimately you won't need all the attributes in the USDA tables, but import them anyway for this step. There are a couple of reasons for this: 1. You might not know ahead of time exactly what you will need. It's easier to just import everything up front than to try and alter tables in the future. 2. If the CSV file has more fields than your table does, PostgreSQL will report errors and fail the import.

It is possible to define your table with the attributes in a different order than in the CSV file and then control which CSV field goes into which attribute. This would just be confusing, though, and complicate the import for no good reason.

To illustrate, here is my USDANutrient table that is intended to receive the data from the file nutrient.csv:

     CREATE TABLE USDANutrient(
         id            INT PRIMARY KEY,
         name          VARCHAR(128),
         unit_name     CHAR(7),
         nutrient_nbr  NUMERIC(4,1),
         rank          NUMERIC(7,1)
     );

As always, define a primary key. The USDA files usually have fairly obvious key fields.

The importing process is tricky, and I recommend using the full syntax of the \copy command rather than one of its abbreviated forms. You will also want to do this on Lemuria using the psql program rather than through DBeaver. The size of the data files is one reason... you don't want to drag the huge files over the network more than once if you can avoid it.

Here is the command I used to import that nutrient.csv file:

     pccFoodTracker=> \copy USDANutrient FROM 'nutrient.csv' WITH
       (FORCE_NULL(nutrient_nbr,rank), DELIMITER ',', FORMAT CSV, HEADER)

A few notes:

  1. Although I show the command above on two lines for clarity, you will need to type one long line. I don't know how to do line continuations for the backslash commands.
  2. The options are a parentheses-enclosed, comma-separated list following the word WITH.
  3. You do need to use both DELIMITER ',' and FORMAT CSV. The FORMAT option turns on the special quote handling (i.e., fields with embedded commas are handled properly).
  4. The HEADER option tells PostgreSQL to ignore the first line.
  5. The FORCE_NULL option is given a comma-separated list of attribute names. If an empty string is seen for the value of those attributes, it will be interpreted as NULL.

The last point above needs elaboration: The issue is with CSV fields that look like empty strings (""). Normally, PostgreSQL will interpret those fields as strings with zero length. That is meaningful (although maybe not what you want) for CHAR and VARCHAR attributes. However, for NUMERIC attributes it is an error because an empty string is not a number. PostgreSQL will not treat such values as zero. After all, zero is a perfectly meaningful number, and might not be an appropriate choice.

The purpose of FORCE_NULL is to get PostgreSQL to treat fields with "" as their value as NULL values. You will want this. Unfortunately you do need to tediously list all attributes where you want this treatment. As far as I know, there is no way to do it globally for the entire table.

It turns out there is a record on line 725844 of the branded_foods.csv file that shows a serving size of 2.56e+8 grams (i.e., 256 million grams). I discovered this because during the import of that file I received the following message:

     psql:temp.sql:1: ERROR: numeric field overflow
     DETAIL: A field with precision 6, scale 2 must round to an absolute value less than 10^4.
     CONTEXT: COPY usdabrandedfood, line 725844, column serving_size: "2.56113168E8"

This serving size is clearly an error. Instead of modifying my table to hold this ridiculous value, I just deleted that record from branded_foods.csv.

Importing USDA Data

When moving data from the USDA tables into the tables for this application, one is naturally led to executing SQL statements such as:

     UPDATE food
     SET Na = (SELECT amount
               FROM   usdafoodnutrient
               WHERE  food.id = usdafoodnutrient.fdc_id AND nutrient_id = 1093);

This statement is trying to set the sodium (with chemical symbol Na) values in the food table. The nutrient ID for sodium is 1093 as shown in the USDA nutrient table.

This statement is, in effect, two nested loops. First, every record in the (large) food table is modified. For each of those records, a SELECT is done in the (even larger) food nutrient table. Furthermore, that SELECT is making use of the fdc_id value, which is not a primary key. As a result, the search of the food nutrient table is quite slow and the entire statement takes an inordinately long time to execute.

To work around this you should first create a supplementary index on the fdc_id attribute in the food nutrient table.

     CREATE INDEX ON usdafoodnutrient(fdc_id);

Creating this index takes a couple of minutes, but it greatly accelerates the rate at which fdc_id values can be located in the food nutrient table. Note that primary keys are always indexed, but since fdc_id is not a primary key it needs its own index in this situation.

Even with the index, the UPDATE command still takes a few minutes to execute. However, this is much, much faster than it would have taken otherwise.

Users

Some of the security requirements can only be implemented by way of stored procedures with the SECURITY DEFINER attribute. Those procedures would be created by the owner of the database itself (you), and made executable by the appropriate role. Because the procedures will run with the authority of the database owner, they will be able to access all tables and data objects. Yet they can limit what their invoker actually does.

This approach is necessary to implement some (or even all) of the security requirements because some tables will naturally share data for multiple users who aren't supposed to be able to see each other's data. For example, the recipe table will (likely) hold information about all recipes. Yet a user should only be able to see their own recipes. If the food_user role has SELECT access on the recipe table, users would be able to see each other's recipes. The solution is to deny SELECT access to that table for food_users, but grant EXECUTE access to a stored procedure that looks up only the records pertaining to a specific user.


4. Deliverables

This lab is fairly complicated. In this section I enumerate the deliverables I would like you to hand in.

Submit your documents to Canvas, in a zip archive if there is more than one document. This lab is worth 40 points.


Last Revised: 2023-04-04
© Copyright 2023 by Peter C. Chapin <pchapin@vtc.edu>