This document is meant to provide context to the data team on our SQL query strategy, as well as to provide context for some manual work that would be helpful to verify that these queries are working as intended. This document can also serve as a point of reference for other engineers who could later optimize these queries or rewrite them.
The query is divided into smaller segments, which are joined together to form the final result. The goal is to return a table that gives us the total over/undersupply of graduates for each region.
+---------------------------------------------------------------------------------------+---------+
| region | total |
|---------------------------------------------------------------------------------------+---------|
| Three Rivers Region--Lafayette, Monroe, Itawamba, Chickasaw & Calhoun Counties | 568 |
| Hagerstown-Martinsburg, MD-WV | 397 |
| Isabella, Gratiot & Clare Counties | 75 |
| Northeast Vermont--Washington, Caledonia, Orleans, Lamoille & Essex Counties | 53 |
| East Central North Dakota--Jamestown City | 25 |
| Myrtle Beach-Conway-North Myrtle Beach, SC-NC | 17 |
| Deep East Texas COG (West) & Walker County | 5 |
| Rochester, MN | -7 |
| Michigan City-La Porte, IN | -12 |
| Waterloo-Cedar Falls, IA | -12 |
| Northwest Lower Peninsula (East) | -14 |
| Purchase Area Development District | -14 |
| Brown, Lyon, Redwood, Chippewa, Yellow Medicine, Lac qui Parle & Lincoln Counties | -16 |
| Utica-Rome, NY | -16 |
| Jackson, TN | -17 |
| Southwest Kansas--Dodge City, Garden City & Liberal City | -18 |
| Humboldt County | -19 |
| East Texas COG (Southwest)--Henderson & Anderson Counties | -20 |
| Pittsfield, MA | -20 |
| Georgia Mountains Regional Commission (Southeast) | -21 |
| Grants Pass, OR | -22 |
...
This table has two fields: the "total" column, which represents the over/undersupply of jobs to graduates in the job fields provided; and the "region" column, which represents the regions on the map. The query has one input, the fields of study, although it can be expanded to include other parameters.
The query can be divided into four distinct parts: First, we determine the demand for jobs in a particular region, for each field provided. Second, we get the cip_4
code for each of these fields. Third, we determine the supply of graduates for each region, for each of these fields, based on the cip_4
codes. Last, we join together the supply and demand results per region, by calculating the difference between the them per field and summing these differences.
This will give us the undersupply or oversupply of graduates for a given region (like the above table). We need to determine how we will calculate the "ratio" based on this figure. We may want to consider looking at normalizing these numbers for the overall population. We could also provide an overall ratio of supply versus demand, as we have talked about before.
These are the job categories that we can query on, based on the distinct possibilities in the labor data:
SELECT DISTINCT job_category_label FROM labor.data;
+------------------------------------------------+
| job_category_label |
|------------------------------------------------|
| Protective Service |
| Sales and Related |
| Production |
| Food Preparation and Serving Related |
| Military Specific |
| Healthcare Practitioners and Technical |
| Life, Physical, and Social Science |
| Transportation and Material Moving |
| Personal Care and Service |
| Education Instruction and Library |
| Computer and Mathematical |
| Arts, Design, Entertainment, Sports, and Media |
| Legal |
| Installation, Maintenance, and Repair |
| Farming, Fishing, and Forestry |
| Architecture and Engineering |
| Construction and Extraction |
| Building and Grounds Cleaning and Maintenance |
| Healthcare Support |
| Business and Financial Operations |
| Community and Social Service |
| Management |
| Office and Administrative Support |
+------------------------------------------------+
In this query, we are aggregating the total number of jobs available in each region, in each job category type, in the year 2019. For instance, if we were to run this query with Legal
and Computer and Mathematical
as the fields, we would get the following output:
SELECT job_category_label, SUM(count_2019) AS demand_2019, region
FROM labor.data
WHERE job_category_label IN ('Legal', 'Computer and Mathematical')
GROUP BY job_category_label, region
ORDER BY region, job_category_label;
+---------------------------+-------------+-------------+---------------------------------------------------------------------------------------+
| job_category_label | demand_2018 | demand_2019 | region |
|---------------------------+-------------+-------------+---------------------------------------------------------------------------------------|
| Computer and Mathematical | 1218 | 2283 | Abilene, TX |
| Legal | 1131 | 1043 | Abilene, TX |
| Computer and Mathematical | 241 | 328 | Acadiana Regional Development District 1--St. Landry & Evangeline Parishes |
| Legal | 538 | 359 | Acadiana Regional Development District 1--St. Landry & Evangeline Parishes |
| Computer and Mathematical | 666 | 481 | Adams, Pike, Brown, Schuyler & Mason Counties |
| Legal | 343 | 287 | Adams, Pike, Brown, Schuyler & Mason Counties |
| Computer and Mathematical | 12143 | 12808 | Akron, OH |
| Legal | 3308 | 2354 | Akron, OH |
| Computer and Mathematical | 1925 | 2408 | Albany, GA |
| Legal | 697 | 564 | Albany, GA |
| Computer and Mathematical | 3944 | 3726 | Albany, OR |
| Legal | 213 | 479 | Albany, OR |
| Computer and Mathematical | 18272 | 21768 | Albany-Schenectady-Troy, NY |
| Legal | 7514 | 8982 | Albany-Schenectady-Troy, NY |
| Computer and Mathematical | 12420 | 12937 | Albuquerque, NM |
| Legal | 7451 | 8343 | Albuquerque, NM |
| Computer and Mathematical | 770 | 1251 | Alexandria, LA |
| Legal | 1022 | 1067 | Alexandria, LA |
| Computer and Mathematical | 1094 | 60 | Allegan County |
| Legal | 147 | 244 | Allegan County |
| Computer and Mathematical | 12371 | 14193 | Allentown-Bethlehem-Easton, PA-NJ |
...
Next, we'll translate these job category types into CIP codes, using the crosswalk. This is necessary in order to query our c2c.field
data, because
each field contains a CIP code, but not a job category type. For instance, here are some of the cip_4
codes associated with the field of Architecture, according to our crosswalk.
SELECT DISTINCT cip_4, job_category_label FROM c2c.crosswalk WHERE job_category_label = 'Architecture and Engineering' ORDER BY cip_4;
+-------+------------------------------+
| cip_4 | job_category_label |
|-------+------------------------------|
| 51.23 | Architecture and Engineering |
| 47.06 | Architecture and Engineering |
| 46.04 | Architecture and Engineering |
| 45.07 | Architecture and Engineering |
| 43.04 | Architecture and Engineering |
| 29.04 | Architecture and Engineering |
| 29.03 | Architecture and Engineering |
| 29.02 | Architecture and Engineering |
| 26.1 | Architecture and Engineering |
| 15.17 | Architecture and Engineering |
| 15.16 | Architecture and Engineering |
| 15.15 | Architecture and Engineering |
| 15.14 | Architecture and Engineering |
| 15.13 | Architecture and Engineering |
| 15.12 | Architecture and Engineering |
...
These cip_4
codes will later be used to query our field data in order to calculate the total number of graduates in each field. The full SQL query for getting the cip_4
codes associated with Arts and Military fields is the following:
SELECT DISTINCT cip_4, job_category_label
FROM c2c.crosswalk
WHERE job_category_label IN ('Arts, Design, Entertainment, Sports, and Media', 'Military Specific')
This subquery is meant to return the total number of graduates for each field, in each region.
This subquery relies on the results of the second subquery, which will provide the cip
codes used to query the field data. It works by aggregating the total number of graduates in each region, for each job category provided. It does this by joining together three different tables:
- The field data, which contains the graduates (ipedscount2) for each specific area of study.
- The temporary cip_codes table from subquery #2, which lets us aggregate those specific areas of study by their field.
- The institution data, which lets us determine the region the graduates are from.
Here's the actual output of the full query, aggregated by region:
SELECT institution.region, cip_codes.job_category_label, SUM(field.ipedscount2) AS graduates
FROM c2c.field
JOIN c2c.institution ON institution.unitid = field.unitid
JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
WHERE
field.credlev_std IN ('BA', 'AS')
AND field.cipcode_rf IN (SELECT cipcode_rf FROM cip_codes)
AND field.ipedscount2 IS NOT NULL
GROUP BY institution.region, cip_codes.job_category_label ORDER BY graduates DESC
+---------------------------------------------------------------------------------------+---------------------------+-----------+
| region | job_category_label | graduates |
|---------------------------------------------------------------------------------------+---------------------------+-----------|
| New York-Newark-Jersey City, NY-NJ-PA | Computer and Mathematical | 39170 |
| Los Angeles-Long Beach-Anaheim, CA | Computer and Mathematical | 38229 |
| Boston-Cambridge-Newton, MA-NH | Computer and Mathematical | 21927 |
| Phoenix-Mesa-Scottsdale, AZ | Computer and Mathematical | 19164 |
| Washington-Arlington-Alexandria, DC-VA-MD-WV | Computer and Mathematical | 15345 |
| San Diego-Carlsbad, CA | Computer and Mathematical | 12961 |
| Chicago-Naperville-Elgin, IL-IN-WI | Computer and Mathematical | 12707 |
| San Francisco-Oakland-Hayward, CA | Computer and Mathematical | 11557 |
| Philadelphia-Camden-Wilmington, PA-NJ-DE-MD | Computer and Mathematical | 11283 |
| Atlanta-Sandy Springs-Roswell, GA | Computer and Mathematical | 9055 |
| Tampa-St. Petersburg-Clearwater, FL | Computer and Mathematical | 9023 |
| Salt Lake City, UT | Computer and Mathematical | 8663 |
| Miami-Fort Lauderdale-West Palm Beach, FL | Computer and Mathematical | 8629 |
| Minneapolis-St. Paul-Bloomington, MN-WI | Computer and Mathematical | 8551 |
| Riverside-San Bernardino-Ontario, CA | Computer and Mathematical | 8084 |
| Sacramento--Roseville--Arden-Arcade, CA | Computer and Mathematical | 8063 |
| Dallas-Fort Worth-Arlington, TX | Computer and Mathematical | 7480 |
| Seattle-Tacoma-Bellevue, WA | Computer and Mathematical | 6520 |
| Houston-The Woodlands-Sugar Land, TX | Computer and Mathematical | 6291 |
| San Jose-Sunnyvale-Santa Clara, CA | Computer and Mathematical | 5640 |
It's easier to understand this query before the data is aggregated. For instance, let's run a very similar query for Legal
and Computer and Mathematical
graduates, but let's not aggregate the results:
WITH cip_codes AS (
SELECT DISTINCT cip_4, job_category_label
FROM c2c.crosswalk
WHERE job_category_label IN ('Computer and Mathematical', 'Healthcare Support')
)
SELECT institution.region, institution.instnm, cip_codes.job_category_label, cip_codes.cip_4, field.ipedscount2 AS graduates
FROM c2c.field
JOIN c2c.institution ON institution.unitid = field.unitid
JOIN cip_codes ON cip_codes.cip_4 = field.cipcode_rf
WHERE
field.credlev_std IN ('BA', 'AS')
AND field.cipcode_rf IN (SELECT cipcode_rf FROM cip_codes)
AND field.ipedscount2 IS NOT NULL;
+---------------------------+-----------------------------------------+---------------------------+-------+-----------+
| region | instnm | job_category_label | cip_4 | graduates |
|---------------------------+-----------------------------------------+---------------------------+-------+-----------|
| Huntsville, AL | Alabama A & M University | Computer and Mathematical | 11.01 | 31 |
| Huntsville, AL | Alabama A & M University | Computer and Mathematical | 27.01 | 3 |
| Huntsville, AL | Alabama A & M University | Computer and Mathematical | 40.05 | 8 |
| Huntsville, AL | Alabama A & M University | Computer and Mathematical | 52.02 | 54 |
| Huntsville, AL | Alabama A & M University | Computer and Mathematical | 52.14 | 12 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 11.01 | 42 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 27.01 | 36 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 38.01 | 17 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 40.05 | 47 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Healthcare Support | 51.07 | 113 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 51.07 | 113 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Healthcare Support | 51.1 | 0 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 52.02 | 162 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 52.12 | 86 |
| Birmingham-Hoover, AL | University of Alabama at Birmingham | Computer and Mathematical | 52.14 | 74 |
| Montgomery, AL | Amridge University | Computer and Mathematical | 52.02 | 13 |
| Montgomery, AL | Amridge University | Computer and Mathematical | 52.02 | 12 |
| Montgomery, AL | Amridge University | Computer and Mathematical | 52.12 | 2 |
| Montgomery, AL | Amridge University | Computer and Mathematical | 52.12 | 0 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 11.01 | 100 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 14.09 | 42 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 27.01 | 19 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 38.01 | 7 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 40.05 | 15 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 52.02 | 85 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 52.12 | 60 |
| Huntsville, AL | University of Alabama in Huntsville | Computer and Mathematical | 52.14 | 36 |
| Montgomery, AL | Alabama State University | Computer and Mathematical | 11.04 | 25 |
| Montgomery, AL | Alabama State University | Computer and Mathematical | 11.07 | 10 |
| Montgomery, AL | Alabama State University | Computer and Mathematical | 26.13 | 0 |
| Montgomery, AL | Alabama State University | Computer and Mathematical | 27.01 | 8 |
...
As you can see, this query returns the total number of graduates in every specific discipline (i.e. cip_4
code) that falls under these umbrella fields. Then we just aggregate together the results per region, based on these umbrella fields.
The final subquery will return the overall result that is used in our map. This query joins together the results from subquery #1 (demand) and subquery #3 (supply).
We arrive at the result for each region by subtracting the overall supply of graduates from the overall demand for the same field, and then summing together those differences to arrive at a total over/undersupply of jobs.
It would be useful to run through a process for obtaining these results in a manual fashion, so that we have some numbers to check against.
For instance, for the fields of Personal Care and Service
and Business and Financial Operations
what are some regions' total under/oversupply of values? What about when we perform this same query with a few other regions, with different combinations of fields?