JQ or any json parser to make a join over mutiple large JSON files

Tags: json jq
By : stackit
Source: Stackoverflow.com
Question!

At this page - https://openlibrary.org/developers/dumps - there are JSON data dumps for 'editions' and 'authors' totalling about 7Gb of data when compressed (about 28Gb when uncompressed).

The editions files are structured like this (the information in each row varies):

/type/edition   /books/OL24712550M  2   2011-08-12T15:48:15.081632  {"subtitle": "finding solace and strength from friends and strangers", "series": ["Thorndike Press large print biography", "Thorndike large print biography series"], "covers": [6783622], "lc_classifications": ["E840.8.E29 E24 2007"], "latest_revision": 2, "ocaid": "savinggracesfind00edwa", "source_records": ["ia:savinggracesfind00edwa"], "title": "Saving graces", "languages": [{"key": "/languages/eng"}], "subjects": ["Cancer", "Family", "Legislators' spouses", "Philosophy", "Patients", "Large type books", "Lawyers' spouses", "Biography", "Protected DAISY"], "subject_people": ["Elizabeth Edwards (1949-)", "John Edwards (1953 June 10-)"], "publish_country": "meu", "by_statement": "Elizabeth Edwards", "oclc_numbers": ["71809986"], "type": {"key": "/type/edition"}, "revision": 2, "publishers": ["Thorndike Press"], "ia_box_id": ["IA133215"], "full_title": "Saving graces finding solace and strength from friends and strangers", "last_modified": {"type": "/type/datetime", "value": "2011-08-12T15:48:15.081632"}, "key": "/books/OL24712550M", "authors": [{"key": "/authors/OL6606949A"}], "publish_places": ["Waterville, Me"], "pagination": "613 p. (large print) ;", "created": {"type": "/type/datetime", "value": "2011-06-29T22:47:47.350358"}, "dewey_decimal_class": ["973.931092", "B"], "number_of_pages": 613, "isbn_13": ["9780786291670"], "lccn": ["2006031151"], "subject_places": ["United States", "North Carolina"], "isbn_10": ["0786291672"], "publish_date": "2007", "copyright_date": "2006", "works": [{"key": "/works/OL15801457W"}]}
/type/edition   /books/OL11119269M  5   2010-04-24T18:14:28.389476  {"number_of_pages": 362, "subtitle": "Godparenthood and Adoption in the Early Middle Ages (The University of Delaware Press Series, the Family in Interdisciplinary Perspective)", "weight": "1.6 pounds", "covers": [2673249], "latest_revision": 5, "edition_name": "Rev Exp edition", "title": "Spiritual Kinship As Social Practice", "languages": [{"key": "/languages/eng"}], "subjects": ["Family & Relationships", "Genealogy, heraldry, names and honours", "c 500 CE to c 1000 CE", "Ancient Rome - History", "Social Institutions", "Sociology", "Ancient Rome", "Sociology - Marriage & Family", "Alternative Family", "Ancient - Rome", "Spirituality - General", "Adoption", "Europe", "History", "Medieval, 500-1500", "Social history", "Sponsors", "To 1500"], "type": {"key": "/type/edition"}, "physical_dimensions": "9.8 x 6.2 x 1 inches", "revision": 5, "publishers": ["University of Delaware Press"], "physical_format": "Hardcover", "last_modified": {"type": "/type/datetime", "value": "2010-04-24T18:14:28.389476"}, "key": "/books/OL11119269M", "authors": [{"key": "/authors/OL797447A"}], "identifiers": {"goodreads": ["2994735"]}, "isbn_13": ["9780874136326"], "isbn_10": ["0874136326"], "publish_date": "June 2000", "works": [{"key": "/works/OL4195029W"}]}
/type/edition   /books/OL25407707M  1   2012-08-08T08:36:18.306844  {"series": ["Then & now"], "lc_classifications": ["F459.E43 C375 2012"], "latest_revision": 1, "source_records": ["marc:marc_loc_updates/v40.i32.records.utf8:13804252:745"], "title": "Elizabethtown", "languages": [{"key": "/languages/eng"}], "subjects": ["Buildings, structures", "Pictorial works", "Historic buildings"], "publish_country": "scu", "by_statement": "Meranda L. Caswell", "type": {"key": "/type/edition"}, "revision": 1, "publishers": ["Arcadia Pub."], "full_title": "Elizabethtown", "last_modified": {"type": "/type/datetime", "value": "2012-08-08T08:36:18.306844"}, "key": "/books/OL25407707M", "authors": [{"key": "/authors/OL1397347A"}], "publish_places": ["Charleston, S.C"], "pagination": "x, 95 p. :", "created": {"type": "/type/datetime", "value": "2012-08-08T08:36:18.306844"}, "lccn": ["2012933881"], "number_of_pages": 95, "isbn_13": ["9780738591667"], "subject_places": ["Elizabethtown (Ky.)", "Elizabethtown", "Kentucky"], "isbn_10": ["0738591661"], "publish_date": "2012", "works": [{"key": "/works/OL16772737W"}]}

The author files are structured like this:

/type/author    /authors/OL100223A  2   2008-09-08T16:20:28.105165  {"name": "Umu Hilmy", "personal_name": "Umu Hilmy", "last_modified": {"type": "/type/datetime", "value": "2008-09-08T16:20:28.105165"}, "key": "/authors/OL100223A", "type": {"key": "/type/author"}, "revision": 2}
/type/author    /authors/OL6606949A 1   2009-05-14T08:13:43.294872  {"name": "Elizabeth Edwards", "created": {"type": "/type/datetime", "value": "2009-05-14T08:13:43.294872"}, "personal_name": "Elizabeth Edwards", "last_modified": {"type": "/type/datetime", "value": "2009-05-14T08:13:43.294872"}, "latest_revision": 1, "key": "/authors/OL6606949A", "birth_date": "1949", "type": {"key": "/type/author"}, "revision": 1}
/type/author    /authors/OL1003081A 5   2012-06-06T22:11:38.525232  {"name": "William Pinder Eversley", "created": {"type": "/type/datetime", "value": "2008-04-01T03:28:50.625462"}, "death_date": "1918", "photos": [6897255, 6897254], "last_modified": {"type": "/type/datetime", "value": "2012-06-06T22:11:38.525232"}, "latest_revision": 5, "key": "/authors/OL1003081A", "birth_date": "1850", "personal_name": "William Pinder Eversley", "type": {"key": "/type/author"}, "revision": 5}

What I want to end up with is a tab-delimited file with only the following information:

OL reference title name isbn_10 isbn_13 subjects subject_places subject_people

For example:

/books/OL24712550M Saving graces Elizabeth Edwards 0786291672 9780786291670 "Cancer", "Family", "Legislators' spouses", "Philosophy", "Patients", "Large type books", "Lawyers' spouses", "Biography", "Protected DAISY" "United States", "North Carolina" "Elizabeth Edwards (1949-)", "John Edwards (1953 June 10-)"

(In some cases of course some of these fields will be empty.)

So all of the information I want is in the editions dump except for the 'name' field which comes from the authors dump, looked up by the reference in the editions dump, eg /authors/OL6606949A.

So I was trying to use JQ with the following query (for testing only few columns):

.personal_name as $names | .authors | {title , name, author: $names[.key]}

But it does not even execute as I am also having problem finding the notation for author key.

By : stackit


Answers

Since subjects and so on can have multiple values, how do you want them separated in the output so as not to be ambiguous?

jq '.personal_name as $names | .authors as $authors| {title, name, author: $names[.key]}'

is the fixed version of the jq command you have in your question, but not using $authors.

Anyways, if you clarify what you're after we can definitely do this!



This video can help you solving your question :)
By: admin