-
Notifications
You must be signed in to change notification settings - Fork 181
feat: new course kick off: Scraping with Apify and AI #2275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
honzajavorek
wants to merge
33
commits into
master
Choose a base branch
from
honzajavorek/ai-course
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
33 commits
Select commit
Hold shift + click to select a range
c11bbb5
re-number existing sections
honzajavorek 304f856
kick off a stub of the AI course
honzajavorek 5872e7b
start working on the first lesson
honzajavorek f7c765c
add todo marks
honzajavorek 29a9b36
few edits
honzajavorek 81a9d8b
continue with the first lesson
honzajavorek 8606ccf
remove line numbers
honzajavorek e1d9924
rename the course
honzajavorek 23c644d
rename lessons
honzajavorek c446776
lure Vale into ignoring crawlee.dev
honzajavorek dca30d3
rename lesson files
honzajavorek 403cab5
make Vale happy
honzajavorek 3ca3f57
ooops
honzajavorek c9355f2
rework the installation
honzajavorek 1b6aa51
finish the first lesson
honzajavorek 98da907
make markdownlint happy
honzajavorek 34cdc3d
better writing
honzajavorek 7333938
we, not you
honzajavorek b20480a
improve the Apify paragraph
honzajavorek e40444f
re-number lessons
honzajavorek 7e688a6
in progress first lesson
honzajavorek 8853170
wrap up the draft of the first lesson
honzajavorek 26a5327
make Vale happier
honzajavorek 8ca6168
fix language and other improvements
honzajavorek 1c1ef4c
make Vale happier
honzajavorek a90132c
repurpose the lesson to agentic development
honzajavorek b1a9040
language improvements
honzajavorek 9b5265c
better wording
honzajavorek ccc9889
change the new actor flow
honzajavorek 312e267
better grammar and flow
honzajavorek f407b21
polish the very intro to the course
honzajavorek 9791d4c
refine info about back-and-forth between ChatGPT and Web IDE
honzajavorek 7baa332
add admonition about why ChatGPT
honzajavorek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,6 +1,7 @@ | ||
| SDK(s) | ||
| [Ss]torages | ||
| Crawlee | ||
| crawlee.dev | ||
| [Aa]utoscaling | ||
| CU | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
183 changes: 183 additions & 0 deletions
183
...es/academy/platform/scraping_with_apify_and_ai/01_developing_scraper_ai_chat.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,183 @@ | ||
| --- | ||
| title: Developing a scraper with AI chat | ||
| description: TBD | ||
| slug: /scraping-with-apify-and-ai/developing-scraper-with-ai-chat | ||
| unlisted: true | ||
| --- | ||
|
|
||
| **In this lesson, we'll use ChatGPT and the Apify platform to create an app for tracking prices on an e-commerce website.** | ||
|
|
||
| --- | ||
|
|
||
| Want to extract data from a website? Even without knowing how to code, we can open [ChatGPT](https://chatgpt.com/) and have a scraper ready. Let's say you want to track prices from [this Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). You'd type something like: | ||
|
|
||
| ```text | ||
| Create a scraper in JavaScript which downloads | ||
| https://warehouse-theme-metal.myshopify.com/collections/sales, | ||
| extracts all the products in Sales and saves a CSV file, | ||
| which contains: | ||
|
|
||
| - Product name | ||
| - Product detail page URL | ||
| - Price | ||
| ``` | ||
|
|
||
| Try it! The generated code will most likely work out of the box, but the resulting program will still have a few caveats. Some are usability issues: | ||
|
|
||
| - _User-operated:_ We have to run the scraper ourselves. If we're tracking price trends, we'd need to remember to run it daily. If we want, for example, alerts for big discounts, manually running the program isn't much better than just checking the site in a browser every day. | ||
| - _Manual data management:_ Tracking prices over time means figuring out how to organize the exported data ourselves. Processing the data could also be tricky since different analysis tools often require different formats. | ||
|
|
||
| Some are technical challenges: | ||
|
|
||
| - _No monitoring:_ Even if we knew how to set up a server or home installation to run our scraper regularly, we'd have little insight into whether it ran successfully, what errors or warnings occurred, how long it took, or what resources it used. | ||
| - _Anti-scraping risks:_ If the target website detects our scraper, they can rate-limit or block us. Sure, we could run it from a coffee shop's Wi-Fi, but eventually they'd block that too, and we'd seriously annoy our barista. | ||
|
|
||
| To overcome these limitations, we'll use [Apify](https://apify.com/), a platform where our scraper can run independently of our computer. | ||
|
|
||
| :::info Why ChatGPT | ||
|
|
||
| We use OpenAI ChatGPT in this course only because it's the most widely used AI chat. Any similar tool, such as Google Gemini or Anthropic Claude, will do. | ||
|
|
||
| ::: | ||
|
|
||
| ## Creating Apify account | ||
|
|
||
| First, let's [create a new Apify account](https://console.apify.com/sign-up). The signup flow takes us through a few checks to confirm we're human and that our email is valid. It's annoying, but necessary to prevent abuse of the platform. | ||
|
|
||
| Once we have an active account, we can start working on our scraper. Using the platform's resources costs money, but worry not, everything we cover here fits within [Apify's free tier](https://apify.com/pricing). | ||
|
|
||
| ## Creating a new Actor | ||
|
|
||
| Your phone runs apps, Apify runs Actors. If we want Apify to run something for us, it must be wrapped in the Actor structure. Conveniently, the platform provides ready-made templates we can use. | ||
|
|
||
| After login, we land on a page called **Apify Store**. Apify serves both as infrastructure where we can privately deploy and run our own scrapers, and as a marketplace where anyone can offer ready-made scrapers to others for rent. But let's hold off on exploring Apify Store for now. We'll navigate to **My Actors** under the **Development** menu: | ||
|
|
||
|  | ||
|
|
||
| Apify supports several ways to start a new project. In **My Actors**, we'll click **Use template**: | ||
|
|
||
|  | ||
|
|
||
| This opens the template selection screen. There are several templates to choose from, each for a different programming language or use case. We'll pick the first template, **Crawlee + Cheerio**. It has a yellow logo with the letters **JS**, which stands for JavaScript. That's the programming language our scraper will be written in: | ||
|
|
||
|  | ||
|
|
||
| This opens a preview of the template, where we'll confirm our choice: | ||
|
|
||
|  | ||
|
|
||
| And just like that, we have our first Actor! It's only a sample scraper that walks through a website and extracts page titles, but it's something we can already run, and it'll work. | ||
|
|
||
| ## Running sample Actor | ||
|
|
||
| The Actor's detail page has plenty of tabs and settings, but for now we'll stay at **Source > Code**. That's where the **Web IDE** is. | ||
|
|
||
| IDE stands for _integrated development environment_. Fear not, it's just a jargon for ‘an app for editing code, somewhat comfortably’. In the Web IDE, we can browse the files the Actor is made of, and change their contents. | ||
|
|
||
|  | ||
|
|
||
| But for now, we'll hold off on changing anything. First, let's check that the Actor works. We'll hit the **Build** button, which tells the platform to take all the Actor files and prepare the program so we can run it. | ||
|
|
||
| The _build_ takes approximately one minute to finish. When done, the button becomes a **Start** button. Finally, we are ready. Let's press it! | ||
|
|
||
| The scraper starts running, and after another short wait, the first rows start to appear in the output table. | ||
|
|
||
|  | ||
|
|
||
| In the end, we should get around 100 results, which we can immediately export to several formats suitable for data analysis, including those which MS Excel or Google Sheets can open. | ||
|
|
||
| ## Modifying the code with ChatGPT | ||
|
|
||
| Of course, we don't want page titles. We want a scraper that tracks e-commerce prices. Let's prompt ChatGPT to change the code so that it scrapes the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). | ||
|
|
||
| :::info The Warehouse store | ||
|
|
||
| In this course, we'll scrape a real e-commerce site instead of artificial playgrounds or sandboxes. Shopify, a major e-commerce platform, has a demo store at [warehouse-theme-metal.myshopify.com](https://warehouse-theme-metal.myshopify.com/). It strikes a good balance between being realistic and stable enough for a tutorial. | ||
|
|
||
| ::: | ||
|
|
||
| First, let's navigate through the tabs to **Source > Input**, where we can change what the Actor takes as input. The sample scraper walks through whatever website we give it in the **Start URLs** field. We'll change it to this URL: | ||
|
|
||
| ```text | ||
| https://warehouse-theme-metal.myshopify.com/collections/sales | ||
| ``` | ||
|
|
||
|  | ||
|
|
||
| Now let's go back to **Source > Code** so we can work with the Web IDE. We'll select a file called `routes.js` inside the `src` directory. We'll see code similar to this: | ||
|
|
||
| ```js | ||
| import { createCheerioRouter } from '@crawlee/cheerio'; | ||
|
|
||
| export const router = createCheerioRouter(); | ||
|
|
||
| router.addDefaultHandler(async ({ enqueueLinks, request, $, log, pushData }) => { | ||
| log.info('enqueueing new URLs'); | ||
| await enqueueLinks(); | ||
|
|
||
| // Extract title from the page. | ||
| const title = $('title').text(); | ||
| log.info(`${title}`, { url: request.loadedUrl }); | ||
|
|
||
| // Save url and title to Dataset - a table-like storage. | ||
| await pushData({ url: request.loadedUrl, title }); | ||
| }); | ||
| ``` | ||
|
|
||
| We'll select all the code and copy to our clipboard. Then we'll switch to [ChatGPT](https://chatgpt.com/), open **New chat** and start with a prompt like this: | ||
|
|
||
| ```text | ||
| I'm building an Apify Actor that will run on the Apify platform. | ||
| I need to modify a sample template project so it downloads | ||
| https://warehouse-theme-metal.myshopify.com/collections/sales, | ||
| extracts all products in Sales, and returns data with | ||
| the following information for each product: | ||
|
|
||
| - Product name | ||
| - Product detail page URL | ||
| - Price | ||
|
|
||
| Before the program ends, it should log how many products it collected. | ||
| Code from routes.js follows. Reply with a code block containing | ||
| a new version of that file. | ||
| ``` | ||
|
|
||
| We'll use <kbd>Shift+↵</kbd> to add a few empty lines, then paste the code from our clipboard. After we submit it, ChatGPT should return a large code block with a new version of `routes.js`. We'll copy it, switch back to the Web IDE, and replace the original `routes.js` content. That's it, we're ready to roll! | ||
|
|
||
| ## Scraping products | ||
|
|
||
| Now let's see if the new code works. The button we previously used for building and running conveniently became a **Save, Build & Start** button, so let's press it and see what happens. In a minute or so we should see the results appearing in the output area. | ||
|
|
||
|  | ||
|
|
||
| At this point, we haven't told the platform much about the data we expect, so the **Overview** pane lists only product URLs. But if we go to **All fields**, we'll see that it really scraped everything we asked for: | ||
|
|
||
| | name | url | price | | ||
| | --- | --- | --- | | ||
| | JBL Flip 4 Waterproof Portable Bluetooth Speaker | https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker | Sale price$74.95 | | ||
| | Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv | Sale priceFrom $1,398.00 | | ||
| | Sony SACS9 10" Active Subwoofer | https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer | Sale price$158.00 | | ||
|
|
||
| …and so on. Looks good! | ||
|
|
||
| Well, does it? If we look closely, the prices include extra text, which isn't ideal. We'll improve this in the next lesson. | ||
|
|
||
| :::tip If output doesn't appear | ||
|
|
||
| If the scraper doesn't produce any rows, make sure you changed the input URL and applied all code changes. | ||
|
|
||
| If that doesn't help, check the **Log** next to **Output**. You can copy the whole log, paste it into ChatGPT, and let it figure out what went wrong. | ||
|
|
||
| If you're still stuck, open a clean new chat in ChatGPT and try the same prompt for `routes.js` again. | ||
|
|
||
| ::: | ||
|
|
||
| ## Wrapping up | ||
|
|
||
| Despite a few flaws, we've successfully created a first working prototype of a price-watching app with no coding knowledge. | ||
|
|
||
| And thanks to Apify, our scraper can [run automatically on a weekly basis](https://docs.apify.com/platform/schedules), we have its output [ready to download in a variety of formats](https://docs.apify.com/platform/storage/dataset), we can [monitor its runs](https://docs.apify.com/platform/monitoring), and we can [work around anti-scraping measures](https://docs.apify.com/platform/proxy). | ||
|
|
||
| To improve our project further, we copy the code, ask ChatGPT to refine it, paste it back into the Web IDE, and rebuild. | ||
|
|
||
| Sounds tedious? In the next lesson, we'll take a look at how we can get the Actor code onto our computer and use the Cursor IDE with a built-in AI agent instead of the Web IDE, so we can develop our scraper faster and with less back-and-forth. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.