main
May 9th, 2012    

CISC 7512X 3.0 2112 W6
Main
Files
Syllabus
Links
Homeworks

Notes
0001

PDFs,etc
DB Design
Oracle Primer
SQL Intro
More SQL
Data Loads
AnalyticFuncs
DB Procs
Indexes/Joins
Hierarchical
Partitions
Security
Dist DBs
DB Speed
Data Mining I
Data Mining II
Bayes classifier src
Data Mining III

PHP (src code)
php src | php gen
C# DB (src code)
More C# DB (src code)
Code Gen
Java DB (src code)

notes_20120215

Sample Data
Stock Ordrs
ctsdata.20120222.zip

SQLRunner

Homeworks

HW1: Email me your name, major (MS/MA?), and 1 sentence telling me how comfortable you are with SQL. Please include "CISC 7512X HW1" in email subject.


HW2: Install OracleXE. I suggest you set it up under its own "OS" (by first installing VirtualBox, or VMware, etc.).

As a `simple' review of SQL, do `Sample Questions' at the end of: sql2.pdf; For the same database, also answer the following questions:

1) Find the company with most employees.

2) Find employees who make more than the average salary within their company.

3) Find employees who make more than the median salary within their company.

4) Find employees whose salary is an outlier (above 2 standard deviations) within their comapny.

5) Find employees whose salary is an outlier (above 95th percentile) within their comapny.

6) Find the company with the highest number of outlying salaries (your choice which outlier to use).

7) Find the company with most non-managing employees.

8) Find the company with highest average difference between manager salary and non-manager employee salary.

9) Assume that each non-managing employee genererates around 2x their salary in revenue. Managing employees don't directly contribute to revenue. Estimate revenue and ``profit'' for each company (assume profit = revenue - all_salaries).

10) Calculate salary skew for profitable (profit > 0) companies from question 9.

Write answers in an email; put "CISC 7512X HW2" in email subject.


HW3: Load the ``Stock Ordrs'' data (ie: stockdata.zip) file into Oracle (using sqlldr), and Postgres. There are two files in the data, "ordr", and "cxl", which represent stock orders and stock order cancelations. The column details for these are in the file (come up with most sensible data-types you feel like).

Write down commands, scripts, whatever you've used to load the data into an email; put "CISC 7512X HW3" in email subject.


HW# 4 (this is more of a `project'; lets see whether anyone finishes this by next week; ultimate due date: last day of class): Load the ``Stock Ordrs'' data (ie: stockdata.zip) file into a database of your choice. Oracle recommended. There are two files in the data, "ordr", and "cxl", which represent stock orders and stock order cancelations. The column details for these are in the file. The file also has an `explanation.txt' file; which has a short example of how to pick out trades/quotes from the data.

Using select SQL query (no PL/SQL or T-SQL), create a "trades" table that represents trades between incoming orders. If the orders really came in at the times they came in, what would be the trading activity?

Obviously you'd do: create table trades as select ... from ...etc. and something similar for quotes.

Trade table columns: tdate symbol tim seq tid oid price qty, where tid is a unique trade id, oid is order id, seq is the sequence of the incoming order that's causing the trade, price is the price at which the trade took place (price of quote, usually), qty is the traded quantity for oid.

Similarly, create a quotes table, that represents a bid, bifqty, ofr, ofrqty (ie: bid and offer) at any given time (after a trade, or order arrival/cancelation).

Feel free to use analytical functions. Google for "oracle analytical functions".

If for some reason you are unable to do this homework, write a page or so explaining exactly what difficulty is preventing you from doing it. The explanation must clearly explain the problem in first paragraph (so we can maybe resolve it), and why none of the common SQL solutions resolve it. Email me if you're planning this route, maybe I can provide some hints.

Submit the SQL queries, and the first 1000 records for trades and quotes tables (note that you can compare the results, everyone in class should get identical output).

For the few adventurous students, you can modify the gendata.pl file (in stockdata.zip), to increase volume (set it to ~1000000; and set number of symbols to ~3000). That will generate relatively realistic data sizes---run your query on that data set.


HW# 5: Write a stored procedure, in PL/SQL or T-SQL, that accepts an argument N, and populates table N with the first N prime numbers. Prime numbers are integers that are 2 or greater, divisible only by themselves and 1. The N table schema is: create table N (N int).

Test your algorithm on numbers going upto 1000000 (anyone can write a stupid prime number check---try to write an efficient one :-).

Submit code for the stored procedure in an email; put "CISC 7512X HW5" in email subject.

EXTRA: In preparation for HW6, load all files in ctsdata.20120222.zip (link on the left) into Oracle or Postgres. The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend).


HW# 6: load all files in ctsdata.20120222.zip (link on the left) into Oracle or Postgres. The format of these files is: cts(date,symbol,open,high,low,close,volume), splits(date,symbol,post,pre), dividend(date,symbol,dividend).

This homework has a few parts; Do not write procedural code (Java, C#, C/C++, etc.) for this homework (all code must be SQL, etc.). We'll only care about daily closing prices, dividends, and splits.

PART 1. Create another table with fields: tdate,symbol,prcnt which will have the daily percentage gain/loss adjusted for dividends and splits.

That is, if stock XYZ today (20120307) issued dividend of $0.25, previous day's (20120306) close was $50 and today's (20120307) close is $51, then today's percentage gain is: 1.5%, since 50 + 0.25 + 50*0.015 = 51. So your table will have: 20120307,XYZ,1.5
You'll need to sequence closing prices, dividends, and splits to account for "daily percentage gain/loss".

For part1: submit query used to construct the table.

PART 2. Background: Pairs trading. Using the percentage returns table you built in part 1: Your task is to identify potential symbol pairs that have high correlation, and are suitable for pairs trading. While everyone agrees that this strategy works, nobody agrees on the best way to identify correlation---especially when considered in relation to the rest of the market.

For this homework, feel free to use whatever you think is appropriate for correlation (if not sure, try Pearson; it's not really appropriate, but it's simple to calculate).

For part2: submit 10 "best" symbol pairs, each of which trades at least ~$10m a day, suitable for pairs trading in first week of February 2012. (your dataset ends 20120222, so use data prior to 20120201 to get pairs, and see if any of them prove lucrative during 20120201 - 20120222 time). Along with the pairs, submit their correlation coefficients for previous year, and the weeks of 20120201 - 20120222. (assume you were trading $1m worth, and you traded those exact 10 pairs, how much would you have gained/lost during that period?).

PART 3. Background: portfolio theory. The gist is that you can lower risk by investing in things that have *low* correlation. While a single stock will go up and down, the *average* returns from say 20 will likely be a lot steadier---provided of course that they're not all correlated (don't move in the same direction at the same time). These are the kinds of funds your retirement account is invested in.

For this homework, build a portfolio of 15 symbols, each from a different industry (use symbols.csv file), each of which has daily average volume over $10m (avg taken over last year), has paid dividends every year (no skipping) for the last 10 years, has returned at least a cumulative 3% a year for the last 15 years, and each of which have the lowest correlation with the rest of the symbols in your portfolio.

Calculate the return on those 15 symbols for 2011. Is that better or worse than S&P500? (how about last 10 years?).

For part3: submit 15 symbols, along with their aggregate 2011 return, compared to S&P500 return for the same time period.

Don't forget: write all code in SQL... let the database do the data crunching. Breakup large steps into smaller steps, using temp tables.


HW# 7: Part1: It's year 2032, and flying cars are finally here. For safety, each is equipped with a black box that every second logs: GPS timestamp, GPS latitude, longitude, and altitude.

To keep its flying registration, your flying car transmits this data nightly to FAA, so your safe flying record can be verified (or appropriate tickets issued if not).

Your job as a contractor for the FAA is to write queries that generate tickets. Your database gets 50 million car records per day (all flying cars in the US, in year 2032), each one logging their position every second for the entire day. Your task is:

1. Write a query that identifies speeders (anyone flying over 500mph).

2. Airborne flying cars must keep a distance of 50 feet away from any other airborne flying car. Write a query that identifies violators and tickets *both* cars [assume airborne means 50 feet altitude].

Note, millions of records, you can't just do an inner join and compare distances.

Part2: It's year 2012, and the city is strapped for cash. You notice license-plate reading cameras at highway entrances and exits, and get an idea... someone is taking a picture and logging the license plate in a database for every car that goes near those cameras. We get: GPS timestamp, GPS latitude and longitude of camera, and car license plate.

Your job as a DMV contractor: write a query to ticket top 5% of speeders, assuming max speed within the city (including highways) is 50mph.

Hint: If a car is photographed at location X, at 9:00:00am, and is then photographed again at some other random location Y, that is 1 mile away at 9:01:00am, that implies the car drove at 60mph (1-mile in 1-minute) between the two locations. Note that we don't care which road the driver used---unless they can bend space-time and teleport, they *had* to have gone at *least* 60mph between the two locations.

Note that part2 is somewhat different from part1... For both parts, assume your database is *huge* (millions of records---try to come up with solutions that are faster than O(n^2)).


HW# 8: Write a program to peform a database backup. Generate a public/private key pair (using GnuPG, or anything else). Your backup program/script must run daily, backup a table in a database into a .csv.gz file (comma delimited, gzip compressed [do not use database specific binary formats]). And encrypt the backup using your PUBLIC key. Note that you can use any language, database, utility, configuration, etc. (cron script or windows scheduler is ok). The key is that the backup is comma delimited, gzip compressed, AND encrypted with public key---and is generated daily. Email me the program/script and instructions on how to set it up to run daily (for cron, I want the crontab line, etc., for windows scheduler, I want a batch file to setup to run and instruction on how to set it up in scheduler)



































© 2006, Particle