diff --git a/.github/styles/config/vocabularies/Docs/accept.txt b/.github/styles/config/vocabularies/Docs/accept.txt index dfb5437af3..3ae17f3011 100644 --- a/.github/styles/config/vocabularies/Docs/accept.txt +++ b/.github/styles/config/vocabularies/Docs/accept.txt @@ -1,6 +1,7 @@ SDK(s) [Ss]torages Crawlee +crawlee.dev [Aa]utoscaling CU diff --git a/sources/academy/platform/apify_platform.md b/sources/academy/platform/apify_platform.md index 8b56843984..5defcbea90 100644 --- a/sources/academy/platform/apify_platform.md +++ b/sources/academy/platform/apify_platform.md @@ -1,7 +1,7 @@ --- title: Introduction to the Apify platform description: Learn all about the Apify platform, all of the tools it offers, and how it can improve your overall development experience. -sidebar_position: 7 +sidebar_position: 1 category: apify platform slug: /apify-platform --- diff --git a/sources/academy/platform/deploying_your_code/index.md b/sources/academy/platform/deploying_your_code/index.md index bd67d39e27..ad79703b51 100644 --- a/sources/academy/platform/deploying_your_code/index.md +++ b/sources/academy/platform/deploying_your_code/index.md @@ -1,7 +1,7 @@ --- title: Deploying your code to Apify description: In this course learn how to take an existing project of yours and deploy it to the Apify platform as an Actor. -sidebar_position: 9 +sidebar_position: 3 category: apify platform slug: /deploying-your-code --- diff --git a/sources/academy/platform/expert_scraping_with_apify/index.md b/sources/academy/platform/expert_scraping_with_apify/index.md index 040dbcf6c8..3298f08a4d 100644 --- a/sources/academy/platform/expert_scraping_with_apify/index.md +++ b/sources/academy/platform/expert_scraping_with_apify/index.md @@ -1,7 +1,7 @@ --- title: Expert scraping with Apify description: After learning the basics of Actors and Apify, learn to develop pro-level scrapers on the Apify platform with this advanced course. -sidebar_position: 13 +sidebar_position: 6 category: apify platform slug: /expert-scraping-with-apify --- diff --git a/sources/academy/platform/getting_started/index.md b/sources/academy/platform/getting_started/index.md index 6c0f744232..2541b76711 100644 --- a/sources/academy/platform/getting_started/index.md +++ b/sources/academy/platform/getting_started/index.md @@ -1,7 +1,7 @@ --- title: Getting started description: Get started with the Apify platform by creating an account and learning about Apify Console, which is where all Apify Actors are born! -sidebar_position: 8 +sidebar_position: 2 category: apify platform slug: /getting-started --- diff --git a/sources/academy/platform/scraping_with_apify_and_ai/01_developing_scraper_ai_chat.md b/sources/academy/platform/scraping_with_apify_and_ai/01_developing_scraper_ai_chat.md new file mode 100644 index 0000000000..ce5f9e1000 --- /dev/null +++ b/sources/academy/platform/scraping_with_apify_and_ai/01_developing_scraper_ai_chat.md @@ -0,0 +1,183 @@ +--- +title: Developing a scraper with AI chat +description: TBD +slug: /scraping-with-apify-and-ai/developing-scraper-with-ai-chat +unlisted: true +--- + +**In this lesson, we'll use ChatGPT and the Apify platform to create an app for tracking prices on an e-commerce website.** + +--- + +Want to extract data from a website? Even without knowing how to code, we can open [ChatGPT](https://chatgpt.com/) and have a scraper ready. Let's say you want to track prices from [this Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). You'd type something like: + +```text +Create a scraper in JavaScript which downloads +https://warehouse-theme-metal.myshopify.com/collections/sales, +extracts all the products in Sales and saves a CSV file, +which contains: + +- Product name +- Product detail page URL +- Price +``` + +Try it! The generated code will most likely work out of the box, but the resulting program will still have a few caveats. Some are usability issues: + +- _User-operated:_ We have to run the scraper ourselves. If we're tracking price trends, we'd need to remember to run it daily. If we want, for example, alerts for big discounts, manually running the program isn't much better than just checking the site in a browser every day. +- _Manual data management:_ Tracking prices over time means figuring out how to organize the exported data ourselves. Processing the data could also be tricky since different analysis tools often require different formats. + +Some are technical challenges: + +- _No monitoring:_ Even if we knew how to set up a server or home installation to run our scraper regularly, we'd have little insight into whether it ran successfully, what errors or warnings occurred, how long it took, or what resources it used. +- _Anti-scraping risks:_ If the target website detects our scraper, they can rate-limit or block us. Sure, we could run it from a coffee shop's Wi-Fi, but eventually they'd block that too, and we'd seriously annoy our barista. + +To overcome these limitations, we'll use [Apify](https://apify.com/), a platform where our scraper can run independently of our computer. + +:::info Why ChatGPT + +We use OpenAI ChatGPT in this course only because it's the most widely used AI chat. Any similar tool, such as Google Gemini or Anthropic Claude, will do. + +::: + +## Creating Apify account + +First, let's [create a new Apify account](https://console.apify.com/sign-up). The signup flow takes us through a few checks to confirm we're human and that our email is valid. It's annoying, but necessary to prevent abuse of the platform. + +Once we have an active account, we can start working on our scraper. Using the platform's resources costs money, but worry not, everything we cover here fits within [Apify's free tier](https://apify.com/pricing). + +## Creating a new Actor + +Your phone runs apps, Apify runs Actors. If we want Apify to run something for us, it must be wrapped in the Actor structure. Conveniently, the platform provides ready-made templates we can use. + +After login, we land on a page called **Apify Store**. Apify serves both as infrastructure where we can privately deploy and run our own scrapers, and as a marketplace where anyone can offer ready-made scrapers to others for rent. But let's hold off on exploring Apify Store for now. We'll navigate to **My Actors** under the **Development** menu: + +![Apify Store welcome screen with Development menu highlighted](images/apify-nav-store.webp) + +Apify supports several ways to start a new project. In **My Actors**, we'll click **Use template**: + +![My Actors page with Use template button](images/apify-nav-my-actors.webp) + +This opens the template selection screen. There are several templates to choose from, each for a different programming language or use case. We'll pick the first template, **Crawlee + Cheerio**. It has a yellow logo with the letters **JS**, which stands for JavaScript. That's the programming language our scraper will be written in: + +![Template selection screen with Crawlee + Cheerio highlighted](images/apify-nav-templates.webp) + +This opens a preview of the template, where we'll confirm our choice: + +![Template preview screen with Use template button](images/apify-nav-template.webp) + +And just like that, we have our first Actor! It's only a sample scraper that walks through a website and extracts page titles, but it's something we can already run, and it'll work. + +## Running sample Actor + +The Actor's detail page has plenty of tabs and settings, but for now we'll stay at **Source** → **Code**. That's where the **Web IDE** is. + +IDE stands for _integrated development environment_. Fear not, it's just a jargon for ‘an app for editing code, somewhat comfortably’. In the Web IDE, we can browse the files the Actor is made of, and change their contents. + +![Web IDE](images/apify-web-ide.webp) + +But for now, we'll hold off on changing anything. First, let's check that the Actor works. We'll hit the **Build** button, which tells the platform to take all the Actor files and prepare the program so we can run it. + +The _build_ takes approximately one minute to finish. When done, the button becomes a **Start** button. Finally, we are ready. Let's press it! + +The scraper starts running, and after another short wait, the first rows start to appear in the output table. + +![Sample Actor output](images/apify-output-sample.webp) + +In the end, we should get around 100 results, which we can immediately export to several formats suitable for data analysis, including those which MS Excel or Google Sheets can open. + +## Modifying the code with ChatGPT + +Of course, we don't want page titles. We want a scraper that tracks e-commerce prices. Let's prompt ChatGPT to change the code so that it scrapes the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales). + +:::info The Warehouse store + +In this course, we'll scrape a real e-commerce site instead of artificial playgrounds or sandboxes. Shopify, a major e-commerce platform, has a demo store at [warehouse-theme-metal.myshopify.com](https://warehouse-theme-metal.myshopify.com/). It strikes a good balance between being realistic and stable enough for a tutorial. + +::: + +First, let's navigate through the tabs to **Source** → **Input**, where we can change what the Actor takes as input. The sample scraper walks through whatever website we give it in the **Start URLs** field. We'll change it to this URL: + +```text +https://warehouse-theme-metal.myshopify.com/collections/sales +``` + +![Actor input](images/apify-input.webp) + +Now let's go back to **Source** → **Code** so we can work with the Web IDE. We'll select a file called `routes.js` inside the `src` folder. We'll see code similar to this: + +```js +import { createCheerioRouter } from '@crawlee/cheerio'; + +export const router = createCheerioRouter(); + +router.addDefaultHandler(async ({ enqueueLinks, request, $, log, pushData }) => { + log.info('enqueueing new URLs'); + await enqueueLinks(); + + // Extract title from the page. + const title = $('title').text(); + log.info(`${title}`, { url: request.loadedUrl }); + + // Save url and title to Dataset - a table-like storage. + await pushData({ url: request.loadedUrl, title }); +}); +``` + +We'll select all the code and copy to our clipboard. Then we'll switch to [ChatGPT](https://chatgpt.com/), open **New chat** and start with a prompt like this: + +```text +I'm building an Apify Actor that will run on the Apify platform. +I need to modify a sample template project so it downloads +https://warehouse-theme-metal.myshopify.com/collections/sales, +extracts all products in Sales, and returns data with +the following information for each product: + +- Product name +- Product detail page URL +- Price + +Before the program ends, it should log how many products it collected. +Code from routes.js follows. Reply with a code block containing +a new version of that file. +``` + +We'll use Shift+↵ to add a few empty lines, then paste the code from our clipboard. After we submit it, ChatGPT should return a large code block with a new version of `routes.js`. We'll copy it, switch back to the Web IDE, and replace the original `routes.js` content. That's it, we're ready to roll! + +## Scraping products + +Now let's see if the new code works. The button we previously used for building and running conveniently became a **Save, Build & Start** button, so let's press it and see what happens. In a minute or so we should see the results appearing in the output area. + +![Warehouse scraper output](images/apify-output-warehouse.webp) + +At this point, we haven't told the platform much about the data we expect, so the **Overview** pane lists only product URLs. But if we go to **All fields**, we'll see that it really scraped everything we asked for: + +| name | url | price | +| --- | --- | --- | +| JBL Flip 4 Waterproof Portable Bluetooth Speaker | https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker | Sale price$74.95 | +| Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv | Sale priceFrom $1,398.00 | +| Sony SACS9 10" Active Subwoofer | https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer | Sale price$158.00 | + +…and so on. Looks good! + +Well, does it? If we look closely, the prices include extra text, which isn't ideal. We'll improve this in the next lesson. + +:::tip If output doesn't appear + +If the scraper doesn't produce any rows, make sure you changed the input URL and applied all code changes. + +If that doesn't help, check the **Log** next to **Output**. You can copy the whole log, paste it into ChatGPT, and let it figure out what went wrong. + +If you're still stuck, open a clean new chat in ChatGPT and try the same prompt for `routes.js` again. + +::: + +## Wrapping up + +Despite a few flaws, we've successfully created a first working prototype of a price-watching app with no coding knowledge. + +And thanks to Apify, our scraper can [run automatically on a weekly basis](https://docs.apify.com/platform/schedules), we have its output [ready to download in a variety of formats](https://docs.apify.com/platform/storage/dataset), we can [monitor its runs](https://docs.apify.com/platform/monitoring), and we can [work around anti-scraping measures](https://docs.apify.com/platform/proxy). + +To improve our project further, we copy the code, ask ChatGPT to refine it, paste it back into the Web IDE, and rebuild. + +Sounds tedious? In the next lesson, we'll take a look at how we can get the Actor code onto our computer and use the Cursor IDE with a built-in AI agent instead of the Web IDE, so we can develop our scraper faster and with less back-and-forth. diff --git a/sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md b/sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md new file mode 100644 index 0000000000..cf721be5d0 --- /dev/null +++ b/sources/academy/platform/scraping_with_apify_and_ai/02_developing_scraper_ai_agent.md @@ -0,0 +1,328 @@ +--- +title: Developing a scraper with AI agent +description: TBD +slug: /scraping-with-apify-and-ai/developing-scraper-with-ai-agent +unlisted: true +--- + +**In this lesson, we'll keep improving our app for tracking prices on an e-commerce website. We'll get its code onto our computer and use Cursor to streamline how we update our scraper.** + +--- + +In the previous lesson, modifying our scraper involved navigating through the Web IDE, copying code, switching to ChatGPT and back, pasting new code, and so on. + +That kind of grind is okay for small edits, but it's not sustainable in the long run. If we want to build something larger, or something robust that we can develop and maintain over time, we need to streamline the process. + +To step up our game, we'll run a few commands and install a few tools so we can bring the tools of the trade onto our computer: + +- _Local development:_ We'll have the Actor files downloaded and we'll be able to run the code locally. This makes it fast and easy to verify any changes. +- _Agentic coding:_ We'll have a locally installed IDE with a built-in AI agent that we can point at the Actor files. We'll be able to tell it what we need, and it'll change the files directly, without hand-holding. +- _Basic versioning:_ We'll be able to develop changes locally while the previous version of our code keeps running on the Apify platform undisturbed. Only once we're happy with what we have will we push the changes back, so they can replace the old version. + +We're getting one tiny step closer to becoming developers, but don't worry. It's not like we'll suddenly need to read code. + +## Installing Node.js + +If we want to run our scraper on our own computer, whether we do it ourselves or have our AI agent do it for us, we first need to set up the environment so the code can run locally. + +Previously we chose to develop our scraper in a mainstream programming language called JavaScript. To run command line programs written in JavaScript, we'll need a tool called Node.js. + +Let's head to the [Download Node.js](https://nodejs.org/en/download) page. We should see a row of configuration dropdowns and a fairly large code block below it, with quite a few commands. Let's check whether the page guessed our operating system correctly, then copy the whole block to the clipboard: + +![Download Node.js](images/nodejs-install.webp) + +Now let's paste it as-is into Terminal (macOS/Linux) or PowerShell (Windows) and run it with . Once the installation finishes, we should see the versions of Node.js and npm, another related tool, printed out: + +```text +... +$ node -v +v24.11.1 +$ npm -v +11.6.2 +``` + +The exact version numbers aren't very important. If we see them printed, we've successfully installed Node.js and npm. + +## Installing Apify CLI + +Now we'll need the Apify CLI. It's a command-line tool that works like a remote control for the Apify platform. It also happens to be written in JavaScript, so we can use the npm tool we just installed to get it onto our computer. Let's run this command: + +```text +npm install -g apify-cli +``` + +Once the command finishes, let's check whether everything went right: + +```text +apify --version +``` + +If it prints something like this, we have the tool installed: + +```text +apify-cli/0.0.0 (1a2b3c4) running on ... with node-0.0.0, installed via ... +``` + +One more thing though. Before we can do any useful work with it, we also need to login: + +```text +apify login +``` + +Let's confirm **Through Apify Console in your default browser** with . The command line tool opens a web page in our browser, where we'll allow it as a remote control to our Apify account. When we return back to the command line, we should see the following success message: + +```text +Success: You are logged in to Apify as hjtest. +``` + +The message mentions our username, in this case `hjtest`. We'll remember it as we'll need it for our next task. + +Awesome, now we're ready to remote control Apify from the command line! + +## Downloading Actor files + +We now got a hold of a handy remote control, let's use it to download the Actor files. In the following command, replace `hjtest` with your own username: + +```text +apify pull hjtest/my-actor +``` + +The following output should appear: + +```text +Success: Pulled to /.../my-actor/ +``` + +The tool created a new folder called `my-actor` and pulled all Actor files to it, so that we can work on them on our computer instead of the Web IDE. Let's run another command to move us into this new folder: + +```text +cd my-actor +``` + +Being inside the folder will help us to run following commands focused just on the project, not affecting any other folders on our disk. + +Now we've got the code of our Actor, but we already know from the previous lesson that Actors first need to be _built_ before they can be _ran_. Let's run the following command, which installs software our Actor depends on: + +```text +npm install +``` + +The command will flood us with output about what is installed, perhaps some warnings, recommendations, etc. Unfortunately it's hard to spot through the noise if the action was successful, but it is safe to assume success if it doesn't scream red about some errors. + +:::tip If it doesn't install + +If the output does scream red with errors, or if later in the lesson you find out you're unable to run the Actor, copy the whole output of `npm install` and paste it to ChatGPT for help. + +::: + +## Running the Actor locally + +Now that we have the Actor available on our computer, does it work? Let's try! + +```text +apify run --input '{"startUrls": [{"url": "https://warehouse-theme-metal.myshopify.com/collections/sales"}]}' +``` + +Plain `apify run` isn't enough for now, because the Actor we made expects that we give it an input with an URL which it's supposed to scrape. Adding `--input` with the subsequent ball of special characters is technically equivalent to what we've been previously doing in Apify when changing the field on the **Input** tab. + +When the run is done, we should see an output similar to this one: + +```text +Info: All default local stores were purged. +Run: npm run start + +> crawlee-cheerio-javascript@0.0.1 start +> node src/main.js + +INFO System info {"apifyVersion":"3.7.0","apifyClientVersion":"2.22.3","crawleeVersion":"3.16.0","osType":"Darwin","nodeVersion":"v25.9.0"} +WARN ProxyConfiguration: The "Proxy external access" feature is not enabled for your account. Please upgrade your plan or contact support@apify.com +INFO CheerioCrawler: Starting the crawler. +INFO CheerioCrawler: Processing page: https://warehouse-theme-metal.myshopify.com/collections/sales +INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down. +INFO CheerioCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":328,"requestsFinishedPerMinute":155,"requestsFailedPerMinute":0,"requestTotalDurationMillis":328,"requestsTotal":1,"crawlerRuntimeMillis":386} +INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true} +``` + +Albeit we cannot quite see how do the scraped items look like, we can spot that our scraper made a single request to https://warehouse-theme-metal.myshopify.com/collections/sales and it finished without crashing. For a start, let's call it a success! + +Now we could continue messing around with files and commands, but luckily, we don't have to. We have now everything in place to let an AI agent do all we wish from now on. But do we have one? One last installation, pinky promise! + +## Installing Cursor + +Cursor is an IDE for browsing code, similar to Apify's Web IDE, but it's an app we install on our computer. Also, it's an IDE with a built-in AI agent, which will help us with all the coding. + +:::info Why Cursor + +We use Cursor in this course because it's the only of the mainstream AI-first IDEs, which offers a free plan. If you're willing to pay, any IDE with an AI agent would fare the same, be it GitHub Copilot in VS Code, Claude Code, or OpenAI Codex. + +::: + +Using Cursor's AI features requires an account, so let's create one. In the browser, let's open the [Sign Up page](https://authenticator.cursor.sh/sign-up) and we'll create a new account in one of the standard ways. When asked to start a subscription, we'll select **Skip for now** to stay on the free plan. + +![Skip starting a paid plan](images/cursor-plan.webp) + +Similarly, when asked to connect GitHub, we'll choose **Maybe later**. Once we're all set, let's [download the app](https://cursor.com/download) and get it installed. + +![Download Cursor](images/cursor-install.webp) + +When we open the app for the first time, it requires a login. We'll click **Log In**, which will send us back to the browser. By choosing **Yes, Log In** we'll confirm that the app can use our account and let's get back to the app. + +![Open project in Cursor](images/cursor-open.webp) + +Let's click on **Open project** and select the folder with our Actor. + +:::tip Locating the Actor folder + +If you struggle to find where the Actor folder is, run `pwd` in the command line, which prints a full path to the folder you're in. + +::: + +When Cursor opens the Actor's project folder, we'll see something similar to the following: + +![Cursor ready](images/cursor-ready.webp) + +We can select files, and if we do so, we can browse and modify their content. The same as in the Web IDE. But as an addition, we now have an integrated AI agent which we can prompt and it'll do to the code at hand whatever we need. + +Finally, onto some agentic coding! + +## Modifying code with Cursor + +First, let's simplify how we can run the Actor. This will be our prompt: + +```text +Change the default input URL of the Actor +to https://warehouse-theme-metal.myshopify.com/collections/sales +``` + +After we submit the prompt, the Agent will start reading the code, planning, and working on completing the task. Before it runs commands, it'll ask us to approve them. + +![Cursor asking for approval to run a command](images/cursor-approve.webp) + +When done, it'll print summary of its work and we'll be able to review all changes made. + +![Cursor asking for a review of changes](images/cursor-review.webp) + +We'll approve all changes and go to the command line to try out if the Actor now works as expected: + +```text +apify run +``` + +We should see a scraper output like before, including the following line: + +```text +INFO CheerioCrawler: Processing page: https://warehouse-theme-metal.myshopify.com/collections/sales +``` + +That's our first successful change to the Actor with an AI agent! Without back-and-forth between the IDE and an AI chat like ChatGPT. Now before pushing this change back to Apify, let's do one more improvement to the scraper. + +## Scraping prices + +In the previous lesson, we noticed that the prices in our resulting dataset are in a rather raw shape: + +| name | url | price | +| --- | --- | --- | +| JBL Flip 4 Waterproof Portable Bluetooth Speaker | https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker | Sale price$74.95 | +| Sony XBR-950G BRAVIA 4K HDR Ultra HD TV | https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv | Sale priceFrom $1,398.00 | +| Sony SACS9 10" Active Subwoofer | https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer | Sale price$158.00 | + +Let's change that. We'll prompt the agent like this, with a clear example of what we want: + +```text +Change the code so that the Actor saves prices as numbers. +Because some prices are "from", let's call the "price" field +"minPrice" instead, as in minimum price. Example follows. + +Before: +Sale price$74.95 +Sale priceFrom $1,398.00 +Sale price$158.00 + +After: +74.95 +1398.00 +158.00 +``` + +When the agent is done, we'll approve the changes and verify in the command line that the Actor runs locally: + +```text +apify run +``` + +It runs, that's nice! But looking at the output, we can't really verify what exactly gets scraped! When we're at it, let's change that with another prompt: + +```text +In the output of the scraper I want to see +how the items being saved look like. +``` + +We'll approve all changes and go to the command line again: + +```text +apify run +``` + +Now, the output of the scraper contains the actual items being scraped and we can verify we've been successful with changing the format of the prices (they appear at the very end of each line): + +```text +... +INFO CheerioCrawler: Processing page: https://warehouse-theme-metal.myshopify.com/collections/sales +INFO CheerioCrawler: Saving dataset item {"name":"JBL Flip 4 Waterproof Portable Bluetooth Speaker","url":"https://warehouse-theme-metal.myshopify.com/products/jbl-flip-4-waterproof-portable-bluetooth-speaker","minPrice":74.95} +INFO CheerioCrawler: Saving dataset item {"name":"Sony XBR-950G BRAVIA 4K HDR Ultra HD TV","url":"https://warehouse-theme-metal.myshopify.com/products/sony-xbr-65x950g-65-class-64-5-diag-bravia-4k-hdr-ultra-hd-tv","minPrice":1398} +INFO CheerioCrawler: Saving dataset item {"name":"Sony SACS9 10\" Active Subwoofer","url":"https://warehouse-theme-metal.myshopify.com/products/sony-sacs9-10-inch-active-subwoofer","minPrice":158} +INFO CheerioCrawler: Saving dataset item {"name":"Sony PS-HX500 Hi-Res USB Turntable","url":"https://warehouse-theme-metal.myshopify.com/products/sony-ps-hx500-hi-res-usb-turntable","minPrice":398} +... +``` + +Now let's push the changes back to Apify, so that our scheduled scraping happening on the platform can benefit from the improvements we've made locally on our computer. + +:::tip Automatically approving changes + +If you'll grow tired of approvals, you can enable _auto-keep_. Go to **Cursor** → **Settings…** → **Cursor Settings** → **Agents** → **Applying Changes** and turn off **Inline Diffs**. + +::: + +## Pushing Actor to Apify + +To replace the Actor files living on the Apify platform with the ones we have locally, we can run the following command: + +```text +apify push +``` + +The command can take a while to finish, because it also immediately triggers a build. Once it's done, the new version of the Actor is ready to be ran. The output of the command ends with these two lines: + +```text +... +Actor detail https://console.apify.com/actors/EL7U7aNddXOzwEJ66 +Success: Actor was deployed to Apify cloud and built there. +``` + +We'll follow the link to our browser and in the Apify interface, we'll click the **Start** button. Soon we should see items popping up in the **Output** section. For a full overview, let's switch to **All fields** again: + +![Modified Apify output](images/apify-output-modified.webp) + +We've done it, the prices save as numbers! + +:::tip Specifying output schema + +If we didn't want to always click on **All fields** to see full items, we need to specify an [output schema](https://docs.apify.com/platform/actors/development/actor-definition/output-schema) so that the platform knows what it can expect and how it should display it in the interface. With Cursor, such change is just a single prompt away: + +```text +Change the output schema of the Actor +so that it represents the items being +saved the best way in the Apify interface. +``` + +::: + +## Wrapping up + +We've been installing and setting up a lot, but once we got our environment ready, we could reap the benefits of fast changes to our scraper. + +With a single prompt we tackled a significant change in how our app stores the prices. And we still didn't need to know any coding. + +To improve our project further, we ask the agent to perform a change, review and approve its work, then execute `apify run` in the command line to verify how it works, and finally `apify push` to upload our Actor files to Apify. + +In the next lesson, we'll take a look at how we can develop our scraper by documenting how it should behave instead of prompting the AI agent feature by feature, without a track record of our intentions. diff --git a/sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md b/sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md new file mode 100644 index 0000000000..62957d551b --- /dev/null +++ b/sources/academy/platform/scraping_with_apify_and_ai/03_docs_driven_prompting.md @@ -0,0 +1,25 @@ +--- +title: Docs driven prompting +description: TBD +slug: /scraping-with-apify-and-ai/docs-driven-prompting +unlisted: true +--- + + + +:::note Course under construction +This page hasn't been written yet. Come later, please! +::: + + + diff --git a/sources/academy/platform/scraping_with_apify_and_ai/04_tests_driven_prompting.md b/sources/academy/platform/scraping_with_apify_and_ai/04_tests_driven_prompting.md new file mode 100644 index 0000000000..69abc81fd8 --- /dev/null +++ b/sources/academy/platform/scraping_with_apify_and_ai/04_tests_driven_prompting.md @@ -0,0 +1,27 @@ +--- +title: Tests driven prompting +description: TBD +slug: /scraping-with-apify-and-ai/tests-driven-prompting +unlisted: true +--- + + + +:::note Course under construction +This page hasn't been written yet. Come later, please! +::: + + diff --git a/sources/academy/platform/scraping_with_apify_and_ai/05_publishing.md b/sources/academy/platform/scraping_with_apify_and_ai/05_publishing.md new file mode 100644 index 0000000000..1eacdd0275 --- /dev/null +++ b/sources/academy/platform/scraping_with_apify_and_ai/05_publishing.md @@ -0,0 +1,20 @@ +--- +title: Publishing to Apify Store +description: TBD +slug: /scraping-with-apify-and-ai/publishing-to-apify-store +unlisted: true +--- + + + +:::note Course under construction +This page hasn't been written yet. Come later, please! +::: + + diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-input.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-input.webp new file mode 100644 index 0000000000..709f21c461 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-input.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-my-actors.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-my-actors.webp new file mode 100644 index 0000000000..8d41e4aee3 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-my-actors.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-store.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-store.webp new file mode 100644 index 0000000000..fa562d5857 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-store.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-template.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-template.webp new file mode 100644 index 0000000000..35fc411d30 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-template.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-templates.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-templates.webp new file mode 100644 index 0000000000..1ff6562404 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-nav-templates.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-modified.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-modified.webp new file mode 100644 index 0000000000..b3f8a76f4c Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-modified.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-sample.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-sample.webp new file mode 100644 index 0000000000..0efe8e9406 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-sample.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-warehouse.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-warehouse.webp new file mode 100644 index 0000000000..0f71274b6d Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-output-warehouse.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/apify-web-ide.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-web-ide.webp new file mode 100644 index 0000000000..644dddc48e Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/apify-web-ide.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-approve.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-approve.webp new file mode 100644 index 0000000000..5a349178bf Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-approve.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-install.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-install.webp new file mode 100644 index 0000000000..f211caae0e Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-install.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-open.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-open.webp new file mode 100644 index 0000000000..92cd864701 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-open.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-plan.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-plan.webp new file mode 100644 index 0000000000..837ce9b6cc Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-plan.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-ready.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-ready.webp new file mode 100644 index 0000000000..d43cecc608 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-ready.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-review.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-review.webp new file mode 100644 index 0000000000..601d3e76b5 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/cursor-review.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/images/nodejs-install.webp b/sources/academy/platform/scraping_with_apify_and_ai/images/nodejs-install.webp new file mode 100644 index 0000000000..6b607d14d3 Binary files /dev/null and b/sources/academy/platform/scraping_with_apify_and_ai/images/nodejs-install.webp differ diff --git a/sources/academy/platform/scraping_with_apify_and_ai/index.md b/sources/academy/platform/scraping_with_apify_and_ai/index.md new file mode 100644 index 0000000000..fdacf0ea56 --- /dev/null +++ b/sources/academy/platform/scraping_with_apify_and_ai/index.md @@ -0,0 +1,40 @@ +--- +title: Scraping with Apify and AI +description: Learn how to use AI to extract information from websites in this practical course, starting from the absolute basics. +sidebar_position: 5 +sidebar_label: Scraping with Apify and AI +category: apify platform +slug: /scraping-with-apify-and-ai +unlisted: true +--- + +import DocCardList from '@theme/DocCardList'; + +**Learn how to use AI to extract information from websites in this practical course, starting from the absolute basics.** + +--- + +In this course we'll use AI assistants to create an application for watching prices. It'll be able to scrape product pages of an e-commerce website and record prices. Data from several runs of such program would be useful for seeing trends in price changes, detecting discounts, etc. + +The end product will, unlike programs vibe-coded carelessly, reach the level of quality allowing for further extensibility and comfortable maintenance, so that it can be published to [Apify Store](https://apify.com/store). + +## What we'll do + +- Use ChatGPT (AI chat) to create a program which extracts data from a web page. +- Save extracted data in various formats, e.g. CSV which MS Excel or Google Sheets can open. +- Use Cursor (AI agent) to improve the program so that it is robust and maintainable. +- Save time and effort with Apify's scraping platform. + +## Who this course is for + +Anyone with basic knowledge of chatting with an AI assistant and affinity to building digital products who wants to start with web scraping can take this course. The course does not expect you to have any prior knowledge of web technologies or scraping. + +## Requirements + +- Prior experience chatting with AI assistants, such as OpenAI ChatGPT, Google Gemini, or Anthropic Claude. +- A macOS, Linux, or Windows machine. If it's your work computer, make sure you have permissions to install new software. +- Familiarity with running commands in Terminal (macOS/Linux) or PowerShell (Windows). Just generally knowing what those are and how to use them is sufficient. + +## Course content + +