Now all pages are stored in memory (each resource content is stored in Resource.text) which cause high memory consumption.
It would be nice to avoid storing Resource.text and save resourcess directly to FS just after they were received
Probably we can use streams for that
- for html, css:
Request -> update links/images/styles/etc. -> saveResource
- all other types:
Request -> saveResource when content modification is not needed
To do:
- Update Resource class - get rid of
text property and related functionality. Probably store reference to stream for resource
- Update scraper mechanism: rework request/save functionality in scraper - replace
requestQueue property with streamsQueue, replace requestedResourcePromises with requestResourceStreams or remove it, use streams instead of promises in request file
- Check and update all actions that use Resource class objects - at least
afterResponse, saveResource
- Measure memory consumption of current implementation and streams implementation
Questions:
- how to handle links to pages which are not downloaded yet? Can we set reference in parent before child is loaded? (see getReference action)
Now all pages are stored in memory (each resource content is stored in
Resource.text) which cause high memory consumption.It would be nice to avoid storing
Resource.textand save resourcess directly to FS just after they were receivedProbably we can use streams for that
Request -> update links/images/styles/etc. -> saveResourceRequest -> saveResourcewhen content modification is not neededTo do:
textproperty and related functionality. Probably store reference to stream for resourcerequestQueueproperty withstreamsQueue, replacerequestedResourcePromiseswithrequestResourceStreamsor remove it, use streams instead of promises in request fileafterResponse,saveResourceQuestions: