Commit Graph

21 Commits

Author SHA1 Message Date
Haiyan Meng
a83433d5cf Optimize memory usage by avoiding accumulating all the referred
documents into a single stack.
2020-06-23 11:36:29 -07:00
Haiyan Meng
2d496e0efe Update golang to 1.14 2020-06-23 11:25:37 -07:00
Haiyan Meng
171412cc98 Use RWMutex to control the map access
Without RWMutex, we may run into fatal error: concurrent map read and map write.
2020-06-23 11:25:37 -07:00
Haiyan Meng
d5c66cb3d4 Add KustomizationDocument.Copy method 2020-02-03 09:59:52 -08:00
Haiyan Meng
154208d331 Improve the efficiency of crawling github by skipping the documents
already in the index
2020-01-24 19:55:56 -08:00
Haiyan Meng
f4636f8555 Add a fileType field into the index 2020-01-17 13:15:49 -08:00
Haiyan Meng
cf8d53a195 Move SeenMap to the utils dir 2020-01-15 15:29:16 -08:00
Haiyan Meng
2e895c147e Use log.Print* instead of fmt.Print* 2020-01-14 15:50:35 -08:00
Haiyan Meng
af131c7471 Use flags to specify crawling mode and github user/repo info 2020-01-14 15:36:12 -08:00
Haiyan Meng
7ac573ae51 Add a flag to specify the index name 2020-01-14 14:25:29 -08:00
Haiyan Meng
72eda992bd make seen a non-primitive type 2020-01-14 12:14:00 -08:00
Haiyan Meng
5f8a8b545b Add "kustomization" into the kustomization filenames used by the crawler 2020-01-06 12:06:18 -08:00
Haiyan Meng
be2e03681d Remove unused param from IndexFunc 2019-12-18 15:56:44 -08:00
Haiyan Meng
127541f610 Support diffrent modes of running the crawler 2019-12-18 15:56:44 -08:00
Haiyan Meng
bef157d6b3 Fix insert/updating document logic 2019-12-18 15:56:44 -08:00
Haiyan Meng
2c2aa928cc Delete non-existing documents from the index 2019-12-18 15:56:44 -08:00
Haiyan Meng
a9244f759e Add supports for crawling a specific git user or repo 2019-12-13 11:18:33 -08:00
Haiyan Meng
50ce2a66a3 Separate the two types of crawling
1) crawling the documents in the index to update these documents;
2) crawling the whole github.
2019-12-12 13:42:07 -08:00
Haiyan Meng
bffc0d7071 Mulitple improvements of the crawler
1) Set document IDs to avoid duplicating documents;
2) Set the `creationTime` field of each document in the index;
3) set the `values`, `kinds` and `identifiers` fields for all documents;
4) Add a `Copy` method into the `Document` struct: this fixes the issue
where all the documents existing in the index point to the same Document
object;
5) Avoid using keystore redis;
6) Set imagePullPolicy to `Always` for crawler jobs.
2019-12-11 11:10:48 -08:00
Jeffrey Regan
e9ab3da164 Fix some nits in the crawler and elsewhere. 2019-12-03 10:44:44 -08:00
Haiyan Meng
84b75afae4 Make the crawler work
1) add the crawler binary and fix the crawler library
2) remove the readiness probe in the search backend
3) add config for redis keystore
4) add github_api_secret.txt file with instructions
2019-11-26 09:50:51 -08:00