diff --git a/data/processed/wikipedia_html/tt0074885/Mean_johnny_barrows_poster_01.jpg b/data/processed/wikipedia_html/tt0074885/Mean_johnny_barrows_poster_01.jpg new file mode 100644 index 000000000..def287049 Binary files /dev/null and b/data/processed/wikipedia_html/tt0074885/Mean_johnny_barrows_poster_01.jpg differ diff --git a/data/processed/wikipedia_html/tt0074885/tt0074885.html b/data/processed/wikipedia_html/tt0074885/tt0074885.html new file mode 100644 index 000000000..ff2bbb463 --- /dev/null +++ b/data/processed/wikipedia_html/tt0074885/tt0074885.html @@ -0,0 +1,175 @@ + + + + +Mean Johnny Barrows + + + + + + + + + + + + + +
+
+
+
+
+

Mean Johnny Barrows

+
+ +
+
+
+
+
+ +
Mean Johnny Barrows
Film poster by John Solie
Directed byFred Williamson
Written byJolivett Cato
Charles Walker
StarringFred Williamson
Roddy McDowall
Stuart Whitman
Luther Adler
Jenny Sherman
Elliott Gould
Music byColeridge-Taylor Perkinson
Distributed byRamana Productions Inc.
Release date
+
  • January 1976 (1976-01) (U.S.)
+
Running time
75 minutes
CountryUnited States
LanguageEnglish
+

Mean Johnny Barrows is a 1976 American crime drama film starring Fred Williamson, who also directed the film; Stuart Whitman; Luther Adler; Jenny Sherman; and Roddy McDowall also star.[1] +

+ +

Plot

+

Johnny Barrows (played by Fred "The Hammer" Williamson) a winner of the Silver Star is dishonorably discharged from the army for punching out his Captain. Shipped back home Stateside, Johnny promptly gets mugged and hauled in by some racist cops who believe him to be drunk. Unable to secure gainful employment, Johnny finds himself on the soup line (with a cameo from "Special Guest Star" Elliott Gould) and down on his luck. +

Walking into an Italian restaurant hoping for a handout, he's offered a job as a killer by Mafiosi Mario Racconi (Stuart Whitman) and his girlfriend Nancy (Jenny Sherman) but Johnny turns him down. It seems that he's not slipped so far as to start doing odd jobs for the Mob. Eventually, Johnny lands a job at a gas station cleaning toilets and scrubbing floors for the mean penny-pinching Richard (R.G. Armstrong), who receives a beating for ripping off Barrows. +

Meanwhile, a Mafia war starts brewing between the Racconi family and the Da Vincis (the family, not the painter). Seems the Da Vinci family wants to bring in all kinds of dope and start peddling it to black and Hispanic kids. The Racconis, being an upstanding Mob family, wants no part of that on their streets. And so it goes, with the Racconi family wiped out in a treacherous double-cross, with only Mario left standing. +

Nancy is kidnapped by the Da Vinci family and gets a message to Johnny claiming that she was made to do "terrible things". Brought to the brink by poverty, The Man constantly screwing him and his love for Nancy, Johnny agrees to become a hired killer for Mario to avenge the Racconis. And so the body count starts going up as Johnny in all his white-suited glory gets mean and starts killing his way through the Da Vinci family. +

+

Cast

+ +

Additional notes

+

The structure of the film was previously used a year before in the film The Farmer (which was shot in 1975 but released in 1977). +

+

References

+
+
    +
  1. ^ "Mean Johnny Barrows". afi.com. Retrieved 2024-02-02. +
  2. +
+
+ + +


+

+
+
+
+
+
+
+
+ + \ No newline at end of file diff --git a/data/processed/wikipedia_html/tt0074888/La-meilleure-facon-de-marcher.jpg b/data/processed/wikipedia_html/tt0074888/La-meilleure-facon-de-marcher.jpg new file mode 100644 index 000000000..e66679c94 Binary files /dev/null and b/data/processed/wikipedia_html/tt0074888/La-meilleure-facon-de-marcher.jpg differ diff --git a/data/processed/wikipedia_html/tt0074888/tt0074888.html b/data/processed/wikipedia_html/tt0074888/tt0074888.html new file mode 100644 index 000000000..c520e2dde --- /dev/null +++ b/data/processed/wikipedia_html/tt0074888/tt0074888.html @@ -0,0 +1,159 @@ + + + + +The Best Way to Walk + + + + + + + + + + + + + +
+
+
+
+
+

The Best Way to Walk

+
+ +
+
+
+
+
The Best Way to Walk
Theatrical release poster
Directed byClaude Miller
Written byLuc Béraud
Claude Miller
Produced byMag Bodard
Jean-François Davy
StarringPatrick Dewaere
Patrick Bouchitey
Christine Pascal
Claude Piéplu
CinematographyBruno Nuytten
Edited byJean-Bernard Bonis
Music byAlain Jomy
Distributed byAMLF
Release dates
+
  • 3 March 1976 (1976-03-03) (France)
  • +
  • 15 January 1978 (1978-01-15) (U.S.)
+
Running time
82 minutes
CountryFrance
LanguageFrench
Box office$13,793[1] (2008 French reissue)
+

The Best Way to Walk (French: La meilleure façon de marcher) is a 1976 French film directed by Claude Miller, his directorial debut. It stars Patrick Dewaere, Patrick Bouchitey, Christine Pascal, Claude Piéplu and Michel Blanc.[2] +

+ +

Plot

+

Marc and Philippe are two teenage counselors at a summer vacation camp in the French countryside in 1960. Marc is very virile, while Philippe is more reserved. One night, Marc surprises Philippe dressed and made-up like a woman. He responds by continually humiliating Philippe. Despite their late-adolescent rivalries and sexual confusion, each achieves an awakening. +

+

Awards

+

The film won the César Award for Best Cinematography, and was nominated for Best Film, Best Actor, Best Director, Best Screenplay, Dialogue or Adaptation and Best Sound. +

+

Cast

+ +

References

+
+
    +
  1. ^ "The Best Way to Walk". +
  2. +
  3. ^ "The Best Way to Walk". unifrance.org. Retrieved 2014-03-10. +
  4. +
+
+ + +


+

+
+
+
+
+
+
+
+ + \ No newline at end of file diff --git a/data/raw/wikipedia/.gitkeep b/data/raw/wikipedia/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/requirements.txt b/requirements.txt index c2685294d..94ce5a178 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,9 @@ # Run the following to install: # pip install -r requirements.txt -pandas -dtale \ No newline at end of file +pandas~=3.0.0 +dtale~=3.19.1 +requests~=2.32.5 +beautifulsoup4~=4.14.3 +libzim~=3.8.0 +python-slugify~=8.0.4 \ No newline at end of file diff --git a/scripts/extract_wiki_html.py b/scripts/extract_wiki_html.py new file mode 100644 index 000000000..c6cfced93 --- /dev/null +++ b/scripts/extract_wiki_html.py @@ -0,0 +1,115 @@ +import os +import re +import csv +import pandas as pd +from bs4 import BeautifulSoup + +BASE_DIR = os.path.dirname(os.path.abspath(__file__)) +INPUT_DIR = os.path.join(BASE_DIR, "../data/processed/wikipedia_html") +OUTPUT_TSV = os.path.join(BASE_DIR, "../data/processed/spreadsheet/wikipedia_metadata4.tsv") + +WHITELIST = { + "slug", + "title", + "poster_filename", + "Directed by", + "Produced by", + "Written by", + "Starring", + "Release date", + "Running time", + "Country", + "Language", + "Budget", + "Box office", + "Plot" +} + +def clean(el): + if not el: + return "" + for br in el.find_all("br"): + br.replace_with(" | ") + return re.sub(r"\s+", " ", el.get_text(" ", strip=True)).strip() + +def parse_html(path, slug): + with open(path, encoding="utf-8") as f: + soup = BeautifulSoup(f, "html.parser") + row = {"slug": slug} + h1 = soup.select_one("h1.firstHeading") + if h1: + row["title"] = h1.get_text(strip=True) + else: + row["title"] = "" + # infobox + infobox = soup.select_one("table.infobox") + if infobox: + img = infobox.select_one("img") + if img and img.get("src"): + row["poster_filename"] = os.path.basename(img["src"]) + else: + row["poster_filename"] = "" + for tr in infobox.select("tr"): + th = tr.select_one(".infobox-label") + td = tr.select_one(".infobox-data") + if th and td: + row[clean(th)] = clean(td) + # sections + content = soup.select_one(".mw-parser-output") + if not content: + return {k: v for k, v in row.items() if k in WHITELIST} + skip = {"references", "external links", "see also"} + current = None + lead = [] + for el in content.children: + if getattr(el, "name", None) == "div" and "mw-heading" in el.get("class", []): + h = el.find(["h2", "h3", "h4", "h5", "h6"]) #assuming no more than first 6 headers need to be looked at + if h: + title = clean(h) + if title.lower() in skip: + current = None + else: + current = title + if current: + row[current] = "" + continue + if not current: + if getattr(el, "name", None) == "p": + text = clean(el) + if text: + lead.append(text) + continue + if el.name in ["p", "ul", "ol", "table"]: + text = clean(el) + if text: + row[current] += text + if lead: + if row.get("Plot"): + row["Plot"] = " | ".join(lead) + " | " + row["Plot"] + else: + row["Plot"] = " | ".join(lead) + return {k: v for k, v in row.items() if k in WHITELIST} + +def main(): + rows = [] + for folder in os.listdir(INPUT_DIR): + path = os.path.join(INPUT_DIR, folder) + html = next((f for f in os.listdir(path) if f.endswith(".html")), None) + if not html: + continue + try: + rows.append(parse_html(os.path.join(path, html), folder)) + except Exception as e: + print("error:", html, e) + df = pd.DataFrame(rows).fillna("") + if df.empty: + print("The folder was empty / None parsed") + return + cols = ["slug", "poster_filename"] + [c for c in df.columns if c not in ("slug", "poster_filename")] + df = df[cols] + os.makedirs(os.path.dirname(OUTPUT_TSV), exist_ok=True) + df.to_csv(OUTPUT_TSV, sep="\t", index=False, quoting=csv.QUOTE_NONE, escapechar="\\") + print(f"Wrote {len(df)} rows -> {OUTPUT_TSV}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/scripts/extract_wiki_zim.py b/scripts/extract_wiki_zim.py new file mode 100644 index 000000000..38955b63a --- /dev/null +++ b/scripts/extract_wiki_zim.py @@ -0,0 +1,103 @@ +import shutil +import re +from bs4 import BeautifulSoup +import os +from libzim.reader import Archive +from libzim.search import Query, Searcher +import csv +from slugify import slugify + +BASE_DIR = os.path.dirname(os.path.abspath(__file__)) +INPUT_TSV = os.path.abspath(os.path.join(BASE_DIR, "../data/raw/imdb_datasets/title.basics.tsv")) +OUTPUT_DIR = os.path.abspath(os.path.join(BASE_DIR, "../data/processed/wikipedia_html")) +ZIM_PATH = os.path.abspath(os.path.join(BASE_DIR, "../data/raw/wikipedia/wikipedia_en_all_maxi_2025-08.zim")) + +os.makedirs(OUTPUT_DIR, exist_ok=True) +zim = Archive(ZIM_PATH) +searcher = Searcher(zim) +print("The Zim file is now opened") + + +def sanitize_slug(slug): + return slugify(slug, separator="_", max_length=200) or "_unknown" + +#Fetch the html AND the images and put them in a folder +def fetch_wikipedia_html_with_images(query, save_dir): + q = Query().set_query(query) + search = searcher.search(q) + if search.getEstimatedMatches() == 0: + return None + results = list(search.getResults(0, 5)) + best_path = results[0] + try: + entry = zim.get_entry_by_path(best_path) + item = entry.get_item() + html_content = bytes(item.content).decode("UTF-8") + except Exception: + return None + soup = BeautifulSoup(html_content, "html.parser") + for img in soup.find_all("img"): + src = img.get("src") + if not src: + continue + img_path = src.lstrip("/") + try: + img_entry = zim.get_entry_by_path(img_path) + img_bytes = bytes(img_entry.get_item().content) + except Exception: + continue + img_name = os.path.basename(img_path) + img_file_path = os.path.join(save_dir, img_name) + with open(img_file_path, "wb") as f: + f.write(img_bytes) + img["src"] = img_name + return str(soup), best_path + +#Go through each row of the tsv file and try to get the movie on wiki +with open(INPUT_TSV, encoding="utf-8") as f: + reader = csv.DictReader(f, delimiter="\t") + for row in reader: + tconst = row["tconst"] + title = row["primaryTitle"] + year = row["startYear"] + titleType = row["titleType"] + if year is None or titleType != "movie": + print("Skipping from TSV: ", title) + continue + already_done = False + for d in os.listdir(OUTPUT_DIR): + if os.path.exists(os.path.join(OUTPUT_DIR, d, f"{tconst}.html")): + already_done = True + break + if already_done: + print(f"Skipping already processed: {tconst}") + continue + # folder for each movie + movie_dir = os.path.join(OUTPUT_DIR, f"_tmp_{tconst}") + os.makedirs(movie_dir, exist_ok=True) + query = f"{title} ({year} film)" if year != "\\N" else title #if year not empty + print(f"fetching Wikipedia HTML + images for {tconst}: {query}") + result = fetch_wikipedia_html_with_images(query, movie_dir) + if result is None: + print("Wikipedia fetch failed") + shutil.rmtree(movie_dir, ignore_errors=True) + continue + else: + html_with_images, slug = result + slug_dir = os.path.join(OUTPUT_DIR, sanitize_slug(slug)) + if html_with_images: + if "Directed by" not in html_with_images: + shutil.rmtree(movie_dir, ignore_errors=True) + continue + if os.path.exists(slug_dir): + shutil.rmtree(movie_dir, ignore_errors=True) + else: + os.rename(movie_dir, slug_dir) + outfile = os.path.join(slug_dir, f"{tconst}.html") + if os.path.exists(outfile): + continue + with open(outfile, "w", encoding="utf-8") as out: + out.write(html_with_images) + else: + shutil.rmtree(movie_dir, ignore_errors=True) + print(f"no Wikipedia page found for {query}") \ No newline at end of file diff --git a/scripts/rank_cols.py b/scripts/rank_cols.py new file mode 100644 index 000000000..03aa1ed94 --- /dev/null +++ b/scripts/rank_cols.py @@ -0,0 +1,63 @@ +import os +import csv +import sys +from collections import defaultdict +from tqdm import tqdm + +BASE_DIR = os.path.dirname(os.path.abspath(__file__)) +TSV_PATH = os.path.join(BASE_DIR, "../data/processed/spreadsheet/wikipedia_metadata3.tsv") +OUTPUT_PATH = os.path.join(BASE_DIR, "../data/processed/spreadsheet/rank_cols_output.txt") + +csv.field_size_limit(min(sys.maxsize, 2**31 - 1)) # try to increase max buffer so it doesn't fail +#https://stackoverflow.com/questions/53538888/counting-csv-column-occurrences-on-the-fly-in-python + +def main(): + lines = [] + + def log(msg=""): + print(msg) + lines.append(str(msg)) + + log(f"Reading: {TSV_PATH}") + + file_size = os.path.getsize(TSV_PATH) + col_filled = defaultdict(int) + row_count = 0 + + with open(TSV_PATH, encoding="utf-8", buffering=4 * 1024 * 1024) as f: + reader = csv.reader(f, delimiter="\t") + headers = next(reader) + num_cols = len(headers) + + with tqdm(total=file_size, unit="B", unit_scale=True, unit_divisor=1024, desc="Processing") as pbar: + for row in reader: + row_count += 1 + for i, val in enumerate(row): + if val and val.strip(): + col_filled[headers[i]] += 1 + pbar.update(sum(map(len, row)) + num_cols) #progress bar + + log(f"\nTotal rows: {row_count:,}") + log(f"Total columns: {num_cols}\n") + + ranked = sorted( + headers, + key=lambda c: col_filled.get(c, 0) / row_count, + reverse=True, + ) + + log(f"{'#':<5} {'Column':<40} {'Filled':>10} {'Total':>10} {'Fill %':>8}") + log("-" * 75) + for i, col in enumerate(ranked, 1): + filled = col_filled.get(col, 0) + pct = filled / row_count * 100 + log(f"{i:<5} {col:<40} {filled:>10,} {row_count:>10,} {pct:>7.1f}%") + + with open(OUTPUT_PATH, "w", encoding="utf-8") as out: + out.write("\n".join(lines)) + + print(f"\nOutput written to: {OUTPUT_PATH}") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/scripts/scrape_wiki.py b/scripts/scrape_wiki.py new file mode 100644 index 000000000..8c7c1e7bd --- /dev/null +++ b/scripts/scrape_wiki.py @@ -0,0 +1,69 @@ +import csv +import os +import requests +from time import sleep + +HEADERS = {"User-Agent": "cse881"} +SEARCH_URL = "https://en.wikipedia.org/w/api.php" +BASE_URL = "https://en.wikipedia.org/api/rest_v1" +BASE_DIR = os.path.dirname(os.path.abspath(__file__)) + +INPUT_TSV = os.path.abspath(os.path.join(BASE_DIR, "../data/raw/imdb_datasets/title.basics.test.tsv")) +OUTPUT_DIR = os.path.abspath(os.path.join(BASE_DIR, "../data/raw/wikipedia/wikipedia_html")) + +os.makedirs(OUTPUT_DIR, exist_ok=True) + +def fetch_wikipedia_html(query): + params = { + "action": "query", + "list": "search", + "srsearch": query, + "format": "json" + } + + resp = requests.get(SEARCH_URL, params=params, headers=HEADERS).json() + results = resp.get("query", {}).get("search", []) + + if not results: + return None + + best_title = results[0]["title"] + wiki_title = best_title.replace(" ", "_") + html_url = f"{BASE_URL}/page/html/{wiki_title}" + r = requests.get(html_url, headers=HEADERS) + + if r.status_code != 200: + return None + return r.text + + +with open(INPUT_TSV, encoding="utf-8") as f: + print("Opened file:", INPUT_TSV) + print("First 500 chars:") + print(f.read(500)) + f.seek(0) + + reader = csv.DictReader(f, delimiter="\t") + for row in reader: + tconst = row["tconst"] + title = row["primaryTitle"] + year = row["startYear"] + outfile = os.path.join(OUTPUT_DIR, f"{tconst}.html") + print(outfile) + + if os.path.exists(outfile): + print(f"Skipping {tconst}: {query}") + continue #if exists, skip + + query = f"{title} {year}" if year != "\\N" else title + print(f"Fetching Wikipedia for {tconst}: {query}") + html = fetch_wikipedia_html(query) + if html: + with open(outfile, "w", encoding="utf-8") as out: + out.write(html) + else: + print(f"No Wikipedia page found") + sleep(0.5) +print("Completed") + +#https://en.wikipedia.org/w/index.php?api=wmf-restbase&title=Special%3ARestSandbox#/Page%20content/get_page_summary__title_