diff --git a/data/processed/wikipedia_html_test/0_0_mhz/0.0_Mhz_poster.jpg b/data/processed/wikipedia_html_test/0_0_mhz/0.0_Mhz_poster.jpg new file mode 100644 index 000000000..6e7cf7aab Binary files /dev/null and b/data/processed/wikipedia_html_test/0_0_mhz/0.0_Mhz_poster.jpg differ diff --git a/data/processed/wikipedia_html_test/0_0_mhz/tt10341248.html b/data/processed/wikipedia_html_test/0_0_mhz/tt10341248.html new file mode 100644 index 000000000..cfc8cceb6 --- /dev/null +++ b/data/processed/wikipedia_html_test/0_0_mhz/tt10341248.html @@ -0,0 +1,219 @@ + + + + +0.0 MHz + + + + + + + + + + + + + +
+
+
+
+
+

0.0 MHz

+
+ +
+
+
+
+
0.0 MHz
Korean theatrical release poster
Directed byYoo Sun-dong
Written byYoo Sun-dong, Jak Jang
Produced byOh Je-hun
Starring
+ +
CinematographyCha Taek-kyun
Edited byMoon In-dae
Music byLee In-gyu Sung Yoon-yong
Production
companies
Smile Entertainment, Monster Factory, Spotlight Pictures
Distributed byShudder
Release date
+
  • May 29, 2019 (2019-05-29) (South Korea)
+
Running time
102 minutes
CountrySouth Korea
LanguageKorean
Box office$980 593[1]
+

0.0 MHz is a 2019 South Korean supernatural horror film written and directed by Yoo Sun-dong and starring Choi Yoon-young, Shin Joo-hwan, and Jung Eun-ji.[2][3] The film is based on webcomic of the same name by Jang Jak initially published in 2012. Shudder released the film on 23 April 2020.[4] +

+ +

Plot

+

A girl who joins the student club of paranormal researchers and decides to go to an infamous village house in which a resident hanged herself a few years ago. Since then, there have been rumors among the locals about a vengeful ghost that settled in the house. The villagers even invited a shaman to drive away the spirit, but that woman was found dead. Not particularly believing in ghosts, the students come to the cursed house to conduct a ritual there and check what happens to the human brain during sleep, when its rhythms fall to 0.0 MHz.[5][6] +

+

Cast

+
  • Choi Yoon-young as Yoon-Jung
  • +
  • Shin Joo-hwan as Han-Seok
  • +
  • Jung Eun-ji as So-Hee
  • +
  • Jung Won-chang as Tae-Soo
  • +
  • Kim Nan-hee as Mother of suicide woman
  • +
  • Lee Sung-yeol as Sang-Yeob
  • +
  • Park Myung-shin as So-Hee's mother
  • +
  • Nam Jung-hee as So-Hee's grandmother
+

References

+
+
    +
  1. ^ "0,0 МГц (2019)". Kinopoisk. Retrieved 17 April 2023. +
  2. +
  3. ^ "0.0MHz (2019)". Letterboxd. Retrieved 17 April 2023. +
  4. +
  5. ^ "0.0 Mhz (2019)". FilmAffinity. Retrieved 17 April 2023. +
  6. +
  7. ^ Squires, John (2 April 2020). "[Trailer] Shudder Summons "The Hair Ghost" Later This Month With Korean Horror Movie '0.0Mhz'". Bloody Disgusting!. Retrieved 17 April 2023. +
  8. +
  9. ^ "0.0 MHz | Apple TV". Apple TV. 28 July 2020. Retrieved 17 April 2023. +
  10. +
  11. ^ "0.0 MHz". HanCinema. Retrieved 17 April 2023. +
  12. +
+
+ +
+
+
+
+
+
+
+ + \ No newline at end of file diff --git a/data/processed/wikipedia_html_test/0_41/041Poster.jpg b/data/processed/wikipedia_html_test/0_41/041Poster.jpg new file mode 100644 index 000000000..ccf15bced Binary files /dev/null and b/data/processed/wikipedia_html_test/0_41/041Poster.jpg differ diff --git a/data/processed/wikipedia_html_test/0_41/tt4570696.html b/data/processed/wikipedia_html_test/0_41/tt4570696.html new file mode 100644 index 000000000..3a3c752fd --- /dev/null +++ b/data/processed/wikipedia_html_test/0_41/tt4570696.html @@ -0,0 +1,129 @@ + + + + +0-41* + + + + + + + + + + + + + +
+
+
+
+
+

0-41*

+
+ +
+
+
+
+

+

+
0–41*
Directed bySenna Hegde
Written bySenna Hegde
Produced byPreetham Hegde
Chetan Shetty
Prasanna Hegde
StarringRajesh Thoyammal
Vipin Kavvai
Abhilash Thoyammal
Priyadath TK
Sanal Manu
CinematographyKeertan Poojary
Edited bySyam Krishnan
Music bySenna Hegde
Unni Abhijith
Glady Abraham
Subha Naidu
Production
company
Marley State of Mind
Release date
+
  • 2016 (2016)
+
Running time
91 minutes
CountryIndia
LanguageMalayalam
+

0–41* is an Indian feature film documentary in Malayalam language by director Senna Hegde examining young volleyball players and their lives in rural Kanhangad, India.[1] +

The film had its world premiere at the 11th Cinema on the Bayou Film Festival at Lafayette, US.[2] +

+ +

Plot

+

A group of local youth in a small town in India are closely knit by a game of volleyball every evening. Rajesh and Vipin lead the two teams with a great deal of passion until a seemingly endless losing streak sets Rajesh and his team on a trail of disbelief and dejection. The docudrama proceeds to draw a subtle parallel between the losing streak and the lives of these youth. Over the course of six days, this film brings into sharp focus their life stories, their aspirations and expectations, their faith and fears and their view on the rest of the world with the rural way of life in India providing a vivid backdrop.[3] +

+

Cast

+
  • Rajesh Thoyammal
  • +
  • Vipin Kavvai
  • +
  • Abhilash Thoyammal
  • +
  • Priyadath TK
  • +
  • Sanal Manu
  • +
  • Ebi Ganesh
  • +
  • Vishnu Lakshmanan
  • +
  • Sunil Thoyammal
  • +
  • Ambu
+

Film festivals

+
  • Official Selection at the 11th Cinema on the Bayou Film Festival,[4] Lafayette, US.
  • +
  • Official Selection at the 1st RapidLion, the South African International Film Festival,[5] Johannesburg, SA
  • +
  • Official Selection at the 3rd Noida International Film Festival,[6] New Delhi, India.
  • +
  • Official Selection at Miami Independent Film Festival,[7] Miami, US.
  • +
  • Official Selection at Newark international Film Festival,[8] Newark, US.
+

Awards

+
  • Winner – Best Cinematography, 3rd Noida International Film Festival[9] On 7 February 2016. New Delhi, India.
  • +
  • Nominated – Best Director, Newark International Film Festival[10] On 11 September 2016. Newark, US.
+

References

+
+
    +
  1. ^ "A tale from Kanhangad reaches Bayou film festival". The Times of India. 20 January 2016. Retrieved 2 February 2016. +
  2. +
  3. ^ "0–41: A Film About Reality Shines in Bayou Festival Platform – UT TV". Uttv.in. Retrieved 2 February 2016. +
  4. +
  5. ^ "Indulge – Kochi". The New Indian Express. Retrieved 2 February 2016. +
  6. +
  7. ^ "Films – PublicView". Cinemaonthebayou.com. Retrieved 2 February 2016. +
  8. +
  9. ^ "Game of Life". The New Indian Express. Archived from the original on 29 January 2016. Retrieved 2 February 2016. +
  10. +
  11. ^ "Noida International Film Festival". Miniboxoffice.com. Retrieved 2 February 2016. +
  12. +
  13. ^ "Miami Independent Film Festival". Miami Indie Fest. Retrieved 2 July 2016. +
  14. +
  15. ^ "Newark International Film Festival". Newark International Film Festival. Retrieved 2 August 2016. +
  16. +
  17. ^ "Feature Films & Feature Documentary Result: 3rd NIFF-16 - Results". Archived from the original on 13 August 2016. +
  18. +
  19. ^ "NIFF 2016 Winners". 15 September 2016. +
  20. +
+
+
+
+
+
+
+
+
+ + \ No newline at end of file diff --git a/data/processed/wikipedia_html_test/0_5_mm/0.5_mm_poster.jpeg b/data/processed/wikipedia_html_test/0_5_mm/0.5_mm_poster.jpeg new file mode 100644 index 000000000..2e9e98ce5 Binary files /dev/null and b/data/processed/wikipedia_html_test/0_5_mm/0.5_mm_poster.jpeg differ diff --git a/data/processed/wikipedia_html_test/0_5_mm/OOjs_UI_icon_edit-ltr-progressive.svg.png b/data/processed/wikipedia_html_test/0_5_mm/OOjs_UI_icon_edit-ltr-progressive.svg.png new file mode 100644 index 000000000..62aec69e5 Binary files /dev/null and b/data/processed/wikipedia_html_test/0_5_mm/OOjs_UI_icon_edit-ltr-progressive.svg.png differ diff --git a/data/processed/wikipedia_html_test/0_5_mm/tt3825360.html b/data/processed/wikipedia_html_test/0_5_mm/tt3825360.html new file mode 100644 index 000000000..b8c708ecd --- /dev/null +++ b/data/processed/wikipedia_html_test/0_5_mm/tt3825360.html @@ -0,0 +1,194 @@ + + + + +0.5 mm + + + + + + + + + + + + + +
+
+
+
+
+

0.5 mm

+
+ +
+
+
+
+
0.5 mm
Poster
0.5ミリ
Directed byMomoko Andō
Screenplay byMomoko Andō
Based on0.5 mm
by Momoko Andō
CinematographyTakahiro Haibara
Release date
+
  • November 8, 2014 (2014-11-08)
+
Running time
3 hours 18 minutes
CountryJapan
LanguageJapanese
+

0.5 mm (0.5ミリ) is a 2014 Japanese drama film written and directed by Japanese novelist and filmmaker Momoko Ando. It was released in Japan on November 8, 2014.[1] +

+ +

Cast

+ +

Reception

+

Critical response

+

Maggie Lee of Variety called the film "a work of both cool precision and endearing eccentricity".[2] +

+

Accolades

+

At the 36th Yokohama Film Festival, the film was chosen as the 3rd best Japanese film of the year[3] and Momoko Andō won the award for Best Director.[4] +

At the 39th Hochi Film Awards, the film won the award for Best Picture and Masahiko Tsugawa won the award for Best Supporting Actor.[5] +

At the 69th Mainichi Film Awards, Momoko Andō won the award for Best Screenplay and Sakura Ando won the award for Best Actress.[6] +

+

References

+
+
    +
  1. ^ 0.5ミリ(2013). allcinema (in Japanese). Stingray. Archived from the original on September 26, 2015. Retrieved June 15, 2015. +
  2. +
  3. ^ Maggie Lee (June 9, 2015). "Film Review: '0.5 mm'". variety.com. Archived from the original on June 10, 2015. Retrieved June 15, 2015. +
  4. +
  5. ^ "2014年日本映画ベストテン" 2014年日本映画ベストテン. homepage3.nifty.com/yokohama-eigasai (in Japanese). Archived from the original on 2015-06-24. Retrieved June 15, 2015. +
  6. +
  7. ^ "第36回ヨコハマ映画祭 2014年日本映画個人賞" 2014年日本映画個人賞. homepage3.nifty.com/yokohama-eigasai (in Japanese). Archived from the original on 2015-09-26. Retrieved June 15, 2015. +
  8. +
  9. ^ 第39回報知映画賞受賞一覧. hochi.co.jp (in Japanese). Archived from the original on December 4, 2014. Retrieved June 15, 2015. +
  10. +
  11. ^ "69th (2014年)". mainichi.jp (in Japanese). Archived from the original on January 21, 2015. Retrieved June 15, 2015. +
  12. +
+
+ +
+
+
+
+
+
+
+ + \ No newline at end of file diff --git a/scripts/dataset_create.py b/scripts/dataset_create.py index 4511228d2..f845d79ce 100644 --- a/scripts/dataset_create.py +++ b/scripts/dataset_create.py @@ -3,22 +3,44 @@ import os from scrape import extract_movie_info script_dir = os.path.dirname(os.path.abspath(__file__)) -file_path = os.path.join(script_dir, "..", "sample_data.xlsx") -movie_data = pd.read_excel(file_path) -print(movie_data.columns) +# file_path = os.path.join(script_dir, "..", "sample_data.xlsx") +# movie_data = pd.read_excel(file_path) +# print(movie_data.columns) -script_dir = os.path.dirname(os.path.abspath(__file__)) -movie_html = os.path.join(script_dir, "..", "data", "tt0074888.html") +BASE_DIR = os.path.dirname(os.path.abspath(__file__)) +INPUT_DIR = r'C:\Users\Prabhaav\Projects\PyCharm\datamining_881\data\processed\wikipedia_html' +SPREADSHEET_DIR = os.path.join(BASE_DIR, "../data/processed/spreadsheets/") -title, directed_by, cast, genre, plot = extract_movie_info(movie_html) -new_row = { - "Movie": title, - "Director": directed_by, - "Cast": ", ".join(cast), - "Genre": genre, - "Plot": plot -} +movie_data = pd.DataFrame(columns=['Title', 'Director', 'Cast', 'Genre', 'Plot', 'Release Date', 'Slug', 'Poster Filename']) -movie_data.loc[len(movie_data)] = new_row -output_path = os.path.join(script_dir, "..", "updated_data.xlsx") +for folder in os.listdir(INPUT_DIR): + path = os.path.join(INPUT_DIR, folder) + script_dir = os.path.join(path, next((f for f in os.listdir(path) if f.endswith(".html")), None)) + if not script_dir: + continue + try: + print(script_dir) + title, directed_by, cast, genre, plot, year, poster_filename = extract_movie_info(script_dir) + new_row = { + "Title": title, + "Director": directed_by, + "Cast": ", ".join(cast), + "Genre": genre, + "Plot": plot, + "Release Date": year, + "Slug": script_dir, + "Poster Filename": poster_filename + } + movie_data.loc[len(movie_data)] = new_row + + except Exception as e: + print("error:", e) + except KeyboardInterrupt: + output_path = os.path.join(SPREADSHEET_DIR, "updated_data.xlsx") + print(output_path) + movie_data.to_excel(output_path, index=False) + quit() + +output_path = os.path.join(SPREADSHEET_DIR, "updated_data.xlsx") +print(output_path) movie_data.to_excel(output_path, index=False) \ No newline at end of file diff --git a/scripts/scrape.py b/scripts/scrape.py index a5fbadf8f..d8e782b75 100644 --- a/scripts/scrape.py +++ b/scripts/scrape.py @@ -1,8 +1,8 @@ from bs4 import BeautifulSoup import os -script_dir = os.path.dirname(os.path.abspath(__file__)) -file_path = os.path.join(script_dir, "..", "data", "tt0074888.html") +# script_dir = os.path.dirname(os.path.abspath(__file__)) +# file_path = os.path.join(script_dir, "..", "data", "tt0074888.html") def extract_movie_info(file_path): @@ -35,9 +35,14 @@ def extract_movie_info(file_path): directed_by = None cast = [] + poster_filename = None + year = None if infobox: rows = infobox.find_all("tr") + img = infobox.select_one("img") + if img and img.get("src"): + poster_filename = os.path.basename(img["src"]) for row in rows: header = row.find("th") @@ -50,7 +55,8 @@ def extract_movie_info(file_path): if header_text == "Directed by": directed_by = data.get_text(" ", strip=True) - + elif "Release date" in header_text: + year = data.get_text(" ", strip=True) elif header_text == "Starring": cast_items = list(data.stripped_strings) cast = cast_items[:5] @@ -73,14 +79,14 @@ def extract_movie_info(file_path): plot = plot.strip() - return title, directed_by, cast, genre, plot + return title, directed_by, cast, genre, plot, year, poster_filename -# ----------------------------- -# Print results -# ----------------------------- -title, directed_by, cast, genre, plot = extract_movie_info(file_path) -print("Title:", title) -print("Directed by:", directed_by) -print("Cast:", cast) -print("Genre:", genre) -print("\nPlot:\n", plot) \ No newline at end of file +# # ----------------------------- +# # Print results +# # ----------------------------- +# title, directed_by, cast, genre, plot = extract_movie_info(file_path) +# print("Title:", title) +# print("Directed by:", directed_by) +# print("Cast:", cast) +# print("Genre:", genre) +# print("\nPlot:\n", plot) \ No newline at end of file diff --git a/updated_data.xlsx b/updated_data.xlsx deleted file mode 100644 index fe854a26d..000000000 Binary files a/updated_data.xlsx and /dev/null differ