Due: Friday, 2023-04-28
Reading: [TBD]
In this lab you will design and implement a database that allows users to track the food they eat. Recording this information is useful for users who want to manage their weight or other health issues such as blood pressure or calcium intake. This document outlines the requirements of the system and discusses some implementation issues. Certain design details are left up to you.
Complex systems are often implemented in three tiers. Clients typically use a web browser (or mobile app) to access the system. JavaScript executing in the browser itself can provide some of the system's functionality. The middle tier is the web server or application server which executes a web application in PHP, Java, Python, or some other language. That program commonly accesses a database. The database may also include some stored procedures that provide some of the system's functionality. The web application executes the "business logic" of the overall application while the database manages the data storage.
One of the central design decisions that must be made when designing an overall system is how the system's functionality will be divided across the three tiers. Often there are many choices for how to do this, with each choice having its own set of advantages and disadvantages. Since this lab is part of a database course, we will make choices that emphasize the database. This is a good approach for a situation where the database has many clients besides just a web application (e.g., a desktop application or a mobile application that communicates directly with the database). Access control and other control logic implemented in the database is automatically available to all clients without having to replicate that logic in each client.
We will make use of the USDA food nutrient data. This data is extensive and comes as several CSV files. The zip archive also contains a PDF document that describes the data and how it is organized. We will not need to use all of this data!. Also be sure the organization of this data does not bias your thinking when you design the database for this assignment. You want to adjust the data to fit your design, not adjust your design to fit the data.
In this section the requirements of the system as a whole are enumerated.
There are three user roles as described below.
Administrator (food_admin). Administrators are able to manage the list of system users and also update the food database. They may also need to perform maintenance actions on other aspects of the database such as correcting/removing bad recipes (to name one example). In effect, administrators need to have full access to every aspect of the database.
Ordinary (food_user). Ordinary Users track their food. They need to be able to record what they eat as well as input named recipes containing ingredients lists (with amounts) and number of servings.
Researcher (food_researcher). Researchers want to access aggregated information that is stripped of identifying data (to protect the privacy of ordinary users) to research trends, correlations between foods and medical outcomes, and similar things.
For this lab, assume that the web application connects to the database as yourself, in your role as the system developer and owner of all database objects. Thus, the web application has full access to everything. Maintain an ordinary table in the database about all the users of your system and their roles as described above. When one of the users logs into the web application, the web application issues a SET ROLE command before interacting with the database on behalf of that user. Database privileges should be adjusted as appropriate to ensure the security requirements are enforced to the greatest degree feasible.
Some security requirements might need help from stored procedures or functions that have execute privilege set for the appropriate roles, but that are created as SECURITY DEFINER procedures. Such procedures have the authority of their owner when they execute and thus, in this situation, are able to access any database object in any way necessary.
Some other security requirements might need help from the web application. For example, users should only be able to see their own recipes. However, if a hypothetical recipe table is SELECTable by food_user, then, in theory, every user could see everyone else's recipe. The web application should still set its role to food_user before selecting from that table, but then only SELECT for the user that is currently logged in.
This approach requires that the database track a possibly large list of users along with their roles and hashed passwords. The web application will use that information to allow users to log in. Note that each user has a single role. If an individual uses the system as both an administrator and an ordinary user, for example, that person would have two distinct user accounts.
The following requirements hold:
Data.UserContact. Basic contact information shall be recorded for every user of the system. This information shall include: username (which shall be unique), hashed password, role, full name, and email address. The email address is optional.
Data.UserData. For ordinary users, it shall be possible to record that user's height (inches), weight (pounds), blood pressure (both systolic and diastolic in mmHG), and blood glucose level (mg/dL) on at least a daily basis. For blood pressure and blood glucose level measurements it shall also be possible to record the time of the measurement. The user data for a particular user is considered "owned" by that user. This is true even if the data is entered or modified by an administrator. One essential difference between user data and user contact information is that historical records of user data must be kept. Users might want to view trends in their weight or blood pressure, for example. However, only the most current contact information needs to be retained.
Data.Food. The food database contains information on basic nutrients of various foods. Each food is represented by a human-friendly name. The nutritional information recorded includes kcals, sugar, fat, carbohydrates, protein, sodium, calcium, iron, and vitamins A, C, and D. It is permitted for information on some or all of these nutrients to be missing. A corresponding "serving size" must be stored to provide context for the nutrient quantities. The units used to measure the serving size must also be stored. Food information is not owned by any ordinary user. Information about a food, including the human-friendly name, should be modifiable by an administrator without disrupting diary entries that use that food.
Data.Recipes. For our purposes recipes consists of lists of ingredients, with amounts (given in terms of the serving size and units used in the food data), and a count of the number of servings in the recipe. Each recipe has a name and a single owner who is allowed to view, update, or use that recipe. If a recipe is changed, there are two possibilities: 1) the change is assumed to affect all previous uses of the recipe. This is appropriate for changes that are corrections, and 2) the change only affects new uses of the recipe. This is appropriate for real changes in the way the recipe is prepared. Recipes cannot be deleted unless they are not being used in any diary entry.
Data.Diary. The diary stores information about foods consumed. Each entry is for a specific (food, amount) pair, and is owned by a particular user and recorded for a particular date. The "amount" component is expressed in terms of the service size and units used in the food data. The "food" component can also refer to a recipe. In that case the "amount" component is interpreted as the number of servings of that recipe.
Data.BMI. The database provides BMI as a computed attribute based on the height and weight information recorded for a user. There shall be a stored function that computes the BMI given a weight and a height (it is a very short function). There shall be a view available that shows BMI information along with user's weight and height data. [Creating a stored function to do this is overkill as the expression to compute BMI is quite simple and could be part of the view definition. However, I want you to get more practice with stored function syntax.]
The following requirements hold:
Security.Roles. Users with the Administrative role have full access to all data in the system. Users with the Researcher role only have access to aggregated user data that is stripped of all individual identifying information. Users with the Ordinary role can only access data "owned" by them or view information on the foods. Specifically, they can read or update their personal data and recipes. Ordinary users cannot read data owned by any other ordinary user.
This section describes a number of issues related to the implementation, and suggestions related to the design of the system. Multiple implementation strategies are possible; do not feel bound by the suggestions here.
When designing your database, you may find it useful to use a diagramming tool. For purposes of this lab you can either create a (detailed) database diagram, or you can write an SQL script in a text editor that defines all the tables and their relationships, or both.
Note that it can be useful to create an abbreviated diagram that just shows entities, and the relationships between them (perhaps with attribute names). Although not detailed enough to create tables from, such a diagram can still help you organize your thoughts.
It is educational (and more satisfying) to work with realistic data rather than creating a small set of "fake" records just for testing purposes. Also, importing and cleaning real data creates a number of interesting issues that are worth experiencing and learning about. The process of getting real data sets to fit into your structured database is sometimes called "data wrangling." It is a non-trivial activity!
The United States Department of Agriculture provides data on foods for free download. As a convenience I provide the complete collection of CSV data files from that site (October 2022 data).
You will want to first import the USDA data into temporary tables. You do not need to import every CSV file provided by the USDA, so only focus on the files that contain the data you need. Start by creating a suitable temporary table for each CSV file you want to import. I recommend creating attributes in the temporary table with exactly the same names as in the USDA file, and in the same order. This will make it easier to understand what is going on. You might want to use different attribute names in your final database, but to make the import process as clear as possible, I'd stick with the USDA names.
Try to create reasonable data types for the attributes. This will require some guesswork, especially when it comes to the length of strings. If you don't get it right, PostgreSQL will fail the import because of type constraint issues. In that case no data will be imported at all since the entire import operation is a single transaction. Make the types as specific as you can; it will simplify processing the data later. In other words, don't try to treat everything as a string when some attributes are clearly numeric.
Ultimately you won't need all the attributes in the USDA tables, but import them anyway for this step. There are a couple of reasons for this: 1. You might not know ahead of time exactly what you will need. It's easier to just import everything up front than to try and alter tables in the future. 2. If the CSV file has more fields than your table does, PostgreSQL will report errors and fail the import.
It is possible to define your table with the attributes in a different order than in the CSV file and then control which CSV field goes into which attribute. This would just be confusing, though, and complicate the import for no good reason.
To illustrate, here is my USDANutrient table that is intended to receive the data from the file nutrient.csv:
CREATE TABLE USDANutrient( id INT PRIMARY KEY, name VARCHAR(128), unit_name CHAR(7), nutrient_nbr NUMERIC(4,1), rank NUMERIC(7,1) );
As always, define a primary key. The USDA files usually have fairly obvious key fields.
The importing process is tricky, and I recommend using the full syntax of the \copy command rather than one of its abbreviated forms. You will also want to do this on Lemuria using the psql program rather than through DBeaver. The size of the data files is one reason... you don't want to drag the huge files over the network more than once if you can avoid it.
Here is the command I used to import that nutrient.csv file:
pccFoodTracker=> \copy USDANutrient FROM 'nutrient.csv' WITH (FORCE_NULL(nutrient_nbr,rank), DELIMITER ',', FORMAT CSV, HEADER)
A few notes:
The last point above needs elaboration: The issue is with CSV fields that look like empty strings (""). Normally, PostgreSQL will interpret those fields as strings with zero length. That is meaningful (although maybe not what you want) for CHAR and VARCHAR attributes. However, for NUMERIC attributes it is an error because an empty string is not a number. PostgreSQL will not treat such values as zero. After all, zero is a perfectly meaningful number, and might not be an appropriate choice.
The purpose of FORCE_NULL is to get PostgreSQL to treat fields with "" as their value as NULL values. You will want this. Unfortunately you do need to tediously list all attributes where you want this treatment. As far as I know, there is no way to do it globally for the entire table.
It turns out there is a record on line 725844 of the branded_foods.csv file that shows a serving size of 2.56e+8 grams (i.e., 256 million grams). I discovered this because during the import of that file I received the following message:
psql:temp.sql:1: ERROR: numeric field overflow DETAIL: A field with precision 6, scale 2 must round to an absolute value less than 10^4. CONTEXT: COPY usdabrandedfood, line 725844, column serving_size: "2.56113168E8"
This serving size is clearly an error. Instead of modifying my table to hold this ridiculous value, I just deleted that record from branded_foods.csv.
When moving data from the USDA tables into the tables for this application, one is naturally led to executing SQL statements such as:
UPDATE food SET Na = (SELECT amount FROM usdafoodnutrient WHERE food.id = usdafoodnutrient.fdc_id AND nutrient_id = 1093);
This statement is trying to set the sodium (with chemical symbol Na) values in the food table. The nutrient ID for sodium is 1093 as shown in the USDA nutrient table.
This statement is, in effect, two nested loops. First, every record in the (large) food table is modified. For each of those records, a SELECT is done in the (even larger) food nutrient table. Furthermore, that SELECT is making use of the fdc_id value, which is not a primary key. As a result, the search of the food nutrient table is quite slow and the entire statement takes an inordinately long time to execute.
To work around this you should first create a supplementary index on the fdc_id attribute in the food nutrient table.
CREATE INDEX ON usdafoodnutrient(fdc_id);
Creating this index takes a couple of minutes, but it greatly accelerates the rate at which fdc_id values can be located in the food nutrient table. Note that primary keys are always indexed, but since fdc_id is not a primary key it needs its own index in this situation.
Even with the index, the UPDATE command still takes a few minutes to execute. However, this is much, much faster than it would have taken otherwise.
Some of the security requirements can only be implemented by way of stored procedures with the SECURITY DEFINER attribute. Those procedures would be created by the owner of the database itself (you), and made executable by the appropriate role. Because the procedures will run with the authority of the database owner, they will be able to access all tables and data objects. Yet they can limit what their invoker actually does.
This approach is necessary to implement some (or even all) of the security requirements because some tables will naturally share data for multiple users who aren't supposed to be able to see each other's data. For example, the recipe table will (likely) hold information about all recipes. Yet a user should only be able to see their own recipes. If the food_user role has SELECT access on the recipe table, users would be able to see each other's recipes. The solution is to deny SELECT access to that table for food_users, but grant EXECUTE access to a stored procedure that looks up only the records pertaining to a specific user.
This lab is fairly complicated. In this section I enumerate the deliverables I would like you to hand in.
A description of the relations in the database. This includes EITHER a detailed diagram of its design OR an SQL script with suitable CREATE TABLE statements (or both, if appropriate). You should also include a document that gives an overview of your design and some comments about how well normalized it is or isn't. Your documentation should also include a view definition for the BMI information. Include the source of the stored function you created to compute BMI.
A document (could be part of the first document) that describes your experience with importing the USDA database. Include any queries or other methods you used to transform the raw USDA data into whatever form you used in your tables. Be sure to comment on any data anomalies that you encountered in the importation process.
Submit your documents to Canvas, in a zip archive if there is more than one document. This lab is worth 40 points.
Last Revised: 2023-04-04
© Copyright 2023 by Peter C. Chapin <pchapin@vtc.edu>