Haiyan Meng
a83433d5cf
Optimize memory usage by avoiding accumulating all the referred
...
documents into a single stack.
2020-06-23 11:36:29 -07:00
Haiyan Meng
2d496e0efe
Update golang to 1.14
2020-06-23 11:25:37 -07:00
Haiyan Meng
171412cc98
Use RWMutex to control the map access
...
Without RWMutex, we may run into fatal error: concurrent map read and map write.
2020-06-23 11:25:37 -07:00
Haiyan Meng
d5c66cb3d4
Add KustomizationDocument.Copy method
2020-02-03 09:59:52 -08:00
Haiyan Meng
154208d331
Improve the efficiency of crawling github by skipping the documents
...
already in the index
2020-01-24 19:55:56 -08:00
Haiyan Meng
f4636f8555
Add a fileType field into the index
2020-01-17 13:15:49 -08:00
Haiyan Meng
cf8d53a195
Move SeenMap to the utils dir
2020-01-15 15:29:16 -08:00
Haiyan Meng
2e895c147e
Use log.Print* instead of fmt.Print*
2020-01-14 15:50:35 -08:00
Haiyan Meng
af131c7471
Use flags to specify crawling mode and github user/repo info
2020-01-14 15:36:12 -08:00
Haiyan Meng
7ac573ae51
Add a flag to specify the index name
2020-01-14 14:25:29 -08:00
Haiyan Meng
72eda992bd
make seen a non-primitive type
2020-01-14 12:14:00 -08:00
Haiyan Meng
5f8a8b545b
Add "kustomization" into the kustomization filenames used by the crawler
2020-01-06 12:06:18 -08:00
Haiyan Meng
be2e03681d
Remove unused param from IndexFunc
2019-12-18 15:56:44 -08:00
Haiyan Meng
127541f610
Support diffrent modes of running the crawler
2019-12-18 15:56:44 -08:00
Haiyan Meng
bef157d6b3
Fix insert/updating document logic
2019-12-18 15:56:44 -08:00
Haiyan Meng
2c2aa928cc
Delete non-existing documents from the index
2019-12-18 15:56:44 -08:00
Haiyan Meng
a9244f759e
Add supports for crawling a specific git user or repo
2019-12-13 11:18:33 -08:00
Haiyan Meng
50ce2a66a3
Separate the two types of crawling
...
1) crawling the documents in the index to update these documents;
2) crawling the whole github.
2019-12-12 13:42:07 -08:00
Haiyan Meng
bffc0d7071
Mulitple improvements of the crawler
...
1) Set document IDs to avoid duplicating documents;
2) Set the `creationTime` field of each document in the index;
3) set the `values`, `kinds` and `identifiers` fields for all documents;
4) Add a `Copy` method into the `Document` struct: this fixes the issue
where all the documents existing in the index point to the same Document
object;
5) Avoid using keystore redis;
6) Set imagePullPolicy to `Always` for crawler jobs.
2019-12-11 11:10:48 -08:00
Jeffrey Regan
e9ab3da164
Fix some nits in the crawler and elsewhere.
2019-12-03 10:44:44 -08:00
Haiyan Meng
84b75afae4
Make the crawler work
...
1) add the crawler binary and fix the crawler library
2) remove the readiness probe in the search backend
3) add config for redis keystore
4) add github_api_secret.txt file with instructions
2019-11-26 09:50:51 -08:00