A Brief Intro to Databases

"A database" = a system to store data plus a standardized way to access and manipulate those data.

Allows you to store data and get it or modify it when you need to easily and efficiently regardless of the amount of data being manipulated. (Database Journal article, see references)

In the context of the web, allows for dynamic page generation and modification. Example -- Amazon.com -- how would they create and keep all those webpages up to date? Every page is generated on the fly. Example, imaginary toptenbooks.html with pseudocode:

<html>

<head> <title>Stuff</title> </head> <body> <h1>Welcome to Amazon</h1> <p> [ retrieve(allBooks) from (allBooksDatabase) order by (numberSold) limit (10) and put into array ($toptenbooks) ] <ol> <li>$toptenbooks[0] <li>$toptenbooks[1] <li>$toptenbooks[2] ...etc </ol> </p> </body> </html>

They never have to change the code of this actual toptenbooks.html page -- they only need to update what's in the allBooks database, and the page content will change accordingly. More on this later.

Databases and Spreadsheets

It helps here to discuss spreadsheets. Example, a simple video games list:

This is fine for a relatively small data set. You can sort by column, and use "find" to jump to certain records. But what if we had *hundreds* of games? Or thousands? This spreadsheet document would get huge. You'd be scrolling all over the place, it would take up more space on your hard drive, etc. You also have a lot of redundancy here, i.e. many records have the same value ("platform") which takes up space (however many bytes to store the word over and over).

When you have a LOT of records, you need a way to show *just* the records you want to see. And how would you integrate this with a web page? Export the spreadsheet to HTML every time you add a new game? What a pain. And what about this redundancy problem (the word "Nintendo" over and over)?

Enter MySQL. SQL = "Structured Query Language" and is the underlying data access and manipulation "language" (really part programming language, part operating system) for most of the world's databases. Oracle, Microsoft Access, even modern versions of FileMaker use SQL. MySQL is a popular open-source implementation of SQL used for many medium-large data sets and has especially good support for web apps.

Terminology

More analogies:

MySQL :: database server with multiple databases

NeoOffice :: folder full of spreadsheets

MySQL database :: multiple tables

NeoOffice spreadsheet :: multiple worksheets (sort of)

MySQL database table :: multiple fields

NeoOffice spreadsheet :: multiple columns or rows

Summary:

    • One MySQL server can have many databases.

    • Each database can have many tables.

    • Each table can have many fields.

    • Each field contains actual data, be it human-readable information, or a "key" (pointer) to another field, possibly in another table, but NOT in another database.

SQL

It's difficult to talk about SQL syntax if you don't already have some data in an SQL database; but, paradoxically, it's difficult to get data into an SQL database without using some SQL. So, we'll make do with an existing database, and skip the creation steps for now. Example: A MySQL database called "videogames".

This is Asteroids on the Atari. This was the coolest thing ever in 1978.

The "videogames" database has two tables: games and platforms. The MySQL commands "SHOW" and "DESCRIBE" are key to telling you about your database(s).

mysql> show tables; +----------------------+ | Tables_in_videogames | +----------------------+ | games | | platforms | +----------------------+ 2 rows in set (0.00 sec)

The "games" table has four fields: a unique ID, title, comments, and platform.

mysql> describe games; +----------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+--------------+------+-----+---------+----------------+ | id | int(11) | | PRI | NULL | auto_increment | | title | varchar(255) | YES | | NULL | | | comments | text | YES | | NULL | | | platform | tinyint(4) | YES | | NULL | | +----------+--------------+------+-----+---------+----------------+ 4 rows in set (0.00 sec)

Note that each field has a "type". (Ignore the other details for now.) They're somewhat obvious -- int = integer, varchar = variable characters (text or numbers) with a maximum length of 255 characters, text = like varchar but bigger. Look at platform, though. Why is it a "tinyint" instead of a text or a varchar? We'll get to that.

And the platforms table has just two fields, an id and a name.

mysql> describe platforms; +-------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +-------+--------------+------+-----+---------+----------------+ | id | int(11) | | PRI | NULL | auto_increment | | name | varchar(255) | YES | | NULL | | +-------+--------------+------+-----+---------+----------------+ 2 rows in set (0.00 sec)

SELECTing

Show records of a table:

mysql> select * from games limit 10; +----+-----------------------------------------+------------------------------------------------------+----------+ | id | title | comments | platform | +----+-----------------------------------------+------------------------------------------------------+----------+ | 1 | Adventure | NULL | 1 | | 2 | Adventures of Tron | NULL | 1 | | 3 | Air Sea Battle | uppercase end label; aka air-sea battle | 1 | | 4 | Alpha Beam with Ernie | NULL | 1 | | 5 | Arcade Pinball | Sears Telegames picture label, same as Video Pinball | 1 | | 6 | Armor Ambush | NULL | 1 | | 7 | Artillery Duel/Chuck Norris Super Kicks | Xonox double ender | 1 | | 8 | Artillery Duel/Spike's Peak | Xonox double ender | 1 | | 9 | Asteroids | NULL | 1 | | 10 | Astroblast | NULL | 1 | +----+-----------------------------------------+------------------------------------------------------+----------+ 10 rows in set (0.00 sec)

LIMIT 10 tells it to just grab the first ten records. Without that clause, it would scroll by very quickly and show us all the records.

Let's be more specific about which records to show:

mysql> SELECT * FROM games WHERE title = "Asteroids"; +-----+-----------+----------+----------+ | id | title | comments | platform | +-----+-----------+----------+----------+ | 9 | Asteroids | NULL | 1 | | 211 | Asteroids | | 7 | +-----+-----------+----------+----------+ 2 rows in set (0.00 sec)

One with platform "1" and one with "7", hmmm...

JOIN

Why is "platform" a number? This goes back to the question about the TYPE being a tinyint, and it's related to the redundancy factor we mentioned earlier. Rather than take up the space to enter "Nintendo" or "Atari" over and over and over again, we just use a number -- an integer smaller than 255 -- which corresponds to a textual entry in the platforms table. This takes up a minimal amount of room in the database, since we only make 1 long text entry for each platform -- the only thing that gets repeated is a single-character integer. Under this scheme, when we want to see all the games for a particular platform, we do a JOIN (fancy database word for a special method of sorting data across multiple tables).

mysql> select games.title, games.comments, platforms.name from games, platforms where games.platform = platforms.id limit 10; +-----------------------------------------+------------------------------------------------------+------------+ | title | comments | name | +-----------------------------------------+------------------------------------------------------+------------+ | Adventure | NULL | Atari 2600 | | Adventures of Tron | NULL | Atari 2600 | | Air Sea Battle | uppercase end label; aka air-sea battle | Atari 2600 | | Alpha Beam with Ernie | NULL | Atari 2600 | | Arcade Pinball | Sears Telegames picture label, same as Video Pinball | Atari 2600 | | Armor Ambush | NULL | Atari 2600 | | Artillery Duel/Chuck Norris Super Kicks | Xonox double ender | Atari 2600 | | Artillery Duel/Spike's Peak | Xonox double ender | Atari 2600 | | Asteroids | NULL | Atari 2600 | | Astroblast | NULL | Atari 2600 | +-----------------------------------------+------------------------------------------------------+------------+ 10 rows in set (0.00 sec)

Note that when your queries get complicated, it helps to break them up onto separate lines, and use capital letters for commands vs. lowercase for tables/fields. Let's show the result of this JOIN with a WHERE clause for just Asteroids:

mysql> SELECT games.title, games.comments, platforms.name FROM games, platforms WHERE games.platform = platforms.id AND games.title = "Asteroids" LIMIT 10; +-----------+----------+------------+ | title | comments | name | +-----------+----------+------------+ | Asteroids | NULL | Atari 2600 | | Asteroids | | Atari 7800 | +-----------+----------+------------+ 2 rows in set (0.00 sec)

More complex queries actually make use of the word JOIN and other keywords to further hone the results, but we're not going to go there.

Databases and the web

What does this have to do with web pages?

http://www.whoopis.com/videogames/

This lets me view and modify the web content without ever touching the HTML or uploading anything, because all I do is work with the database contents.

Example: Mixing HTML with variables and function calls which rely on queries to the database. Here's the source code:

echo "<p>There are currently " . countEntries() . " games in the $dbname database.";

countEntries() is a function which, when called, executes the following query:

SELECT count(*) FROM games;

Now I add a game. What happens when I refresh the web page? And what does it look like when you view the source in your browser?

This code is executed on the server, the query is made to the database, and the results are displayed back to your web browser as plain old HTML.

Here's the block of code that gets used when you do a search for a game:

$searchterm = $args[1]; $sql = "SELECT games.id, games.title, games.comments, platforms.name FROM games, platforms WHERE platforms.id = games.platform AND (games.title LIKE '%$searchterm%' OR games.comments LIKE '%$searchterm%') ORDER BY games.title ";

This is the JOIN we looked at earlier, combined with a search using wildcards (the percent sign is a wildcard in SQL -- used here, it lets us search on part of a name), and the search term can occur in the game title OR in the comments field. We then sort (ORDER BY) the title, so the records appear alphabetized in the web page.

Relationality, Scalability, Data Types, etc.

The scheme where you separate out the redundant data into another table and tie the two together with an ID, is what makes a database relational and it is where the real power lies when you're working with a data set with records in the tens of thousands or millions. The example that Ms. Patten and I work with every day is our district's Student Information System -- a very relational database (dozens of tables) for tracking everything about students. Think about the quantity of data here:

1 student * 4 years in high school * 12 courses per year (just a guess) * 4 quarters (grading periods) per course = 1 * 4 * 12 * 4 = 196 individual records of final grades for just that one student. We're not even counting individual assignment grades or progress reports (which are also stored in the database). Then there's attendance -- per period, per day, per year -- you can see how fast this might grow. If we used a spreadsheet to hold all of that, our computers would burst into flames every time we opened the file.

How is it actually stored on disk? In an efficient binary format. The size on disk of the database files is determined largely by the data types of your fields,so when you're designing a database, it's good to sketch it out first and think about what kind of data each field is going to hold, and how much. Example -- we learned the hard way at my old job that a varchar(255) is NOT big enough for a lot of modern URLs, and we lost some data due to truncation.

Also, SQL has a LOT of data types beyond what we've discussed -- for example, a DATETIME field which stores a date and time in a structured format that allows you to do date arithmetic. E.g. "show me all records where the date is earlier than 10/2/2008" -- SQL doesn't know that "September" is earlier than "December", those are just strings of letters as far as it's concerned. (In fact, it would think that December is "earlier" than September because D comes before S alphabetically.) But if you use the DATETIME type, SQL will be able to accurately search and sort by date, even taking into account leap years and days of the week and everything.

Now think about: Wikipedia, Facebook, MySpace, etc. not to mention Amazon, eBay, and Gmail. Millions (more like billions) of web pages -- there's no way they could exist in this form, with all the instantaneous updates through your browser and all the interlinked data and searchability etc. without databases. It's why they can revert vandalism instantly on Wikipedia, why you can create a blog page that wasn't there a second ago just by clicking a button in your browser, why Gmail lets you search your email so fast. Moral of the story: Databases are really boring, but really important.

The end!

References

Wikipedia article on SQL.

http://en.wikipedia.org/wiki/SQL

My very basic overview of setting up and using MySQL. Light on theory.

http://www.whoopis.com/howtos/mysqlbasics.html

A tutorial on MySQL and PHP, like what the videogames database and web front-end use.

http://www.php-mysql-tutorial.com/

An in-depth article on databases in general with some web-app specifics. Heavy on theory.

http://www.databasejournal.com/sqletc/article.php/1428721/Introduction-to-Databases-for-the-Web-Pt-1.htm