Practical Issues for Automated Categorization of Web Sites
John M. Pierre
Metacode Technologies, Inc.
139 Townsend Street, Suite 100
San Francisco, CA 94107
jpierre@metacode.com
September 2000
Abstract:
In this paper we discuss several issues related to automated text classification
of web sites. We analyze the nature of web content and metadata and requirements
for text features. We present an approach for targeted spidering including
metadata extraction and opportunistic crawling of specific semantic hyperlinks.
We describe a system for automatically classifying web sites into industry
categories and present performance results based on different combinations
of text features and training data.
1. Introduction
There are an estimated 1 billion pages accessible on the web with 1.5 million
pages being added daily. Describing and organizing this vast amount of
content is essential for realizing its full potential as an information
resource. Accomplishing this in a meaningful way will require consistent
use of metadata and other descriptive data structures such as semantic
linking[1]. Categorization is
an important ingredient as is evident from the popularity of web directories
such as Yahoo![2], Looksmart[3],
and the Open Directory Project[4].
However these resources have been created by large teams of human editors
and represent only one kind of classification task that, while widely useful,
can never be suitable to all applications. Automated classification is
needed for at least two important reasons. The first is the sheer scale
of resources available the web and their ever-changing nature. It is simply
not feasible to keep up with the pace of growth and change on the web through
manual classification without expending immense time and effort. The second
reason is that classification itself is a subjective activity. Different
classification tasks are needed for different applications. No single classification
scheme is suitable for all applications. In this paper we discuss some
practical issues for applying methods of automated text categorization
to web content. Rather than a take a one size fits all approach we advocate
the use of targeted specific classification tasks, relevant to solving
specific problems. In section 2 we
discuss the nature of web content and its implications for extracting good
text features. We describe a specialized system for classifying web sites
into industry categories in section 3,
and present the results in section 4.
In section 5 we discuss related
work. We state our conclusions and make suggestions for further study in
section 6.
2. Web Sites
One the main challenges with classifying web pages is the wide variation
in their content and quality. Most text categorization methods rely on
the existence of good quality texts, especially for training[5].
Unlike many of the well known collections typically studied in automated
text classification experiments (i.e. TREC, Reuters-22578, OSHUMED), in
comparison the web lacks homogeneity and regularness. To make matters worse,
much of the existing web page content is based in images, plug-in applications,
or other non-text media. The usage of metadata is inconsistent or non-existent.
In this section we survey the landscape of web content, and its relation
to the requirements of text categorization systems.
2.1 Analysis of Web Content
In an attempt to characterize the nature of the content to be classified,
we performed a rudimentary quantitative analysis. Our results were obtained
by analyzing a collection of 29,998 web domains obtained from a random
dump of the database of a well-known domain name registration company.
Of course these results reflect the biases of our small samples and don't
necessarily generalize to the web as a whole, however they should be reflective
of the issues at hand. Since our classification method is text based, it
is important to know the amount and quality of the text based features
that typically appear in web sites. In Table 1
we show the percentage of web sites with a certain number of words for
each type of metatag. We analyzed a sample of 19195 domains with live web
sites and counted the number of words used in the content attribute of
the
<META name=``keywords''> and <META name=``description''>
tags as well as <TITLE> tags. We also counted free text found
within the <BODY> tag, excluding all other HTML tags.
Table 1: Percentage of Web Pages with Words in HTML Tags
Tag Type |
0 words |
1-10 words |
11-50 words |
51+ words |
Title |
4% |
89% |
6% |
1% |
Meta-Description |
68% |
8% |
21% |
3% |
Meta-Keywords |
66% |
5% |
19% |
10% |
Body Text |
17% |
5% |
21% |
57% |
The most obvious source of text is within the body of the web page.
We noticed that about 17% of top level web pages had no usable body text.
These cases include pages that only contain frame sets, images, or plug-ins
(our user agent followed redirects whenever possible). About a quarter
of web pages contained 1-50 words, and the majority of web pages contained
over 50 words. Other sources of text are the content in HTML tags including
titles, metatags, and hyperlinks. One of the more promising sources of
text features should be found in web page metadata. Though title tags are
common the amount of text is relatively small with 89% of the titles containing
only 1-10 words. Also, the titles often contain only names or terms such
as ``home page'', which are not particularly helpful for subject classification.
Metatags for keywords and descriptions are used by several major search
engines, where they play an important role in the ranking and display of
search results. Despite this, only about a third of web sites were found
to contain these tags. As it turns out, metatags can be useful when they
exist because they contain text specifically intended to aid in the identification
of a web site's subject areas1.
Most of the time these metatags contained between 11 and 50 words, with
a smaller percentage containing more than 50 words (in contrast to the
number of words in the body text which tended to contain more than 50 words).
2.2 Good Text Features
In reference[5] it is argued that
for the purposes of automated text classification text features should
be:
-
Relatively few in number
-
Moderate in frequency of assignment
-
Low in redundancy
-
Low in noise
-
Related in semantic scope to the classes to be assigned
-
Relatively unambiguous in meaning
Due to the wide variety of purpose and scope of current web content, items
4 and 5 are difficult requirements to meet for most classification tasks.
For subject classification, metatags seem to meet those requirements better
than other sources of text such as titles and body text. However the lack
of widespread usage of metatags is a problem if coverage of the majority
of web content is desired. In the long term, automated categorization could
really benefit if greater attention is paid to the creation and usage of
rich metadata, especially if the above requirements are taken into consideration.
In the short term, one must implement a strategy for obtaining good text
features from the existing HTML and natural language cues that takes the
above requirements as well as the goals of the classification task into
consideration.
3. Experimental Setup
The goal of our project was to rapidly classify domain names (web sites)
into broad industry categories. In this section we describe the main ingredients
of our classification experiments including the data, architecture, and
evaluation measures.
3.1 Classification Scheme
The categorization scheme used was the top level of the 1997 North American
Industrial Classification Scheme (NAICS) [6],
which consists of 21 broad industry categories shown in Table 2.
Table 2: Top level NAICS Categories
NAICS code |
NAICS Description |
11 |
Agriculture, Forestry, Fishing, and Hunting |
21 |
Mining |
22 |
Utilities |
23 |
Construction |
31-33 |
Manufacturing |
42 |
Wholesale Trade |
44-45 |
Retail Trade |
48-49 |
Transportation and Warehousing |
51 |
Information |
52 |
Finance and Insurance |
53 |
Real Estate and Rental and Leasing |
54 |
Professional, Scientific and Technical Services |
55 |
Management of Companies and Enterprises |
56 |
Administrative and Support, Waste Management and Remediation
Services |
61 |
Educational Services |
62 |
Health Care and Social Assistance |
71 |
Arts, Entertainment and Recreation |
72 |
Accommodation and Food Services |
81 |
Other Services (except Public Administration) |
92 |
Public Administration |
99 |
Unclassified Establishments |
Some of our resources had been previously classified using the older
1987 Standard Industrial Classification (SIC) system. In these cases we
used the published mappings[6] to
convert all assigned SIC categories to their NAICS equivalents. All lower
level NAICS subcategories were generalized up to the appropriate top level
category.
3.2 Targeted Spidering
Based on the results of section 2,
it is obvious that selection of adequate text features is an important
issue and certainly not to be taken for granted. To balance the needs of
our text-based classifier against the speed and storage limitations of
a large-scale crawling effort, we took an approach for spidering web sites
and gathering text that was targeted to the classification task at hand.
Our opportunistic spider begins at the top level web page and attempts
to extract useful text from metatags and titles if they exist, and then
follows links for frame sets if they exist. It also follows any links that
contain key substrings such as prod, services,
about,
info, press, and news, and again looks for metatag
content. These substrings were chosen based on an ad hoc frequency
analysis and the assumption that they tend to point to content that is
useful for deducing an industry classification. Only if no metatag content
is found does the spider gather actual body text of the web page. For efficiency
we ran several spiders in parallel, each working on different lists of
individual domain names. What we were attempting here was to take advantage
of the current web's implicit semantic structure. One the advantages
of moving towards an explicit semantic structure for hypertext documents[1]
is that an opportunistic spidering approach could really benefit from a
formalized description of the semantic relationships between linked web
pages. In some preliminary tests we found the best classifier accuracy
was obtained by using only the contents of the keywords and description
metatags as the source of text features. Adding body text decreased classification
accuracy. However, due to the lack of widespread usage of metatags limiting
ourselves to these features was not practical, and other sources of text
such as titles and body text were needed to provide adequate coverage of
web sites. Our targeted spidering approach attempts to gather the higher
quality text features from metatags and only resorts to lower quality texts
if needed.
3.3 Test Data
>From our initial list of 29,998 domain names we used our targeted spider
to determine which sites were live and obtained extracted text using the
approach outlined in section 3.2.
Of those, 13,557 domain names had usable text content and were pre-classified
according to industry category2.
3.4 Training Data
We took two approaches to constructing training sets for our classifiers.
In the first approach we used a combination of 426 NAICS category labels
(including subcategories) and 1504 U.S. Securities and Exchange Commission
(SEC) 10-K filings3
for public companies[7] as training
examples. In the second approach we used a set of 3618 pre-classified domain
names along with text for each domain obtained using our spider. The first
approach can be considered as using ``prior knowledge'' obtained in a different
domain. It is interesting to see how knowledge from a different domain
generalizes to our problem. Furthermore it is often the case that training
examples can be difficult to obtain (thus the need for an automated solution
in the first place). The second approach is the more conventional classification
by example. In our case it was made possible by the fact that our database
of domain names was pre-classified according one or more industry categories.
3.5 Classifier Architecture
Our text classifier consisted of three modules: the targeted spider for
extracting text features associated with a web site, an information retrieval
engine for comparing queries to training examples, and a decision algorithm
for assigning categories. Our spider was designed to quickly process a
large database of top level web domain names (e.g. domain.com, domain.net,
etc.). As described in section 3.2
we implemented an opportunistic spider targeted to finding high quality
text from pages that described the business area, products, or services
of a commercial web site. After accumulating text features, a query was
submitted to the text classifier. The domain name and any automatically
assigned categories were logged in a central database. Several spiders
could be run in parallel for efficient use of system resources. Our information
retrieval engine was based on Latent Semantic Indexing (LSI)[8].
LSI is a variation of the vector space model of information retrieval that
uses the technique of singular value decomposition (SVD) to reduce the
dimensionality of the vector space. In a previous work[7]
it was shown that LSI provided better accuracy with fewer training set
documents per category than standard TF-IDF weighting. Queries were compared
to training set documents based on their cosine similarity, and a ranked
list of matching documents and scores was forwarded to the decision module.
In the decision module, we used a k-nearest neighbor algorithm for ranking
categories and assigned the top ranking category to the web site. This
type of classifier tends to perform well compared to other methods[11],
is robust, and tolerant of noisy data (all are important qualities when
dealing with web content).
3.6 Evaluation Measures
System evaluation was carried out using the standard precision, recall,
and F1 measures[9,10].
The F1 measure combines precision and recall with equal importance into
a single parameter for optimization and is defined as
where P is precision and R is recall. We computed global estimates of performance
using both micro-averaging (results are computed based on global sums over
all decisions) and macro-averaging (results are computed on a per-category
basis, then averaged over categories). Micro-averaged scores tend to be
dominated by the most commonly used categories, while macro-averaged scores
tend to be dominated by the performance in rarely used categories. This
distinction was relevant to our problem, because it turned out that the
vast majority of commercial web sites are associated with the Manufacturing
(31-33) category.
4. Results
In our first experiment we varied the sources of text features for 1125
pre-classified web domains. We constructed separate test sets based on
text extracted from the body text, metatags (keywords and descriptions),
and a combination of both. The training set consisted of SEC documents
and NAICS category descriptions. Results are shown in Table 3.
Table 3: Performance vs. Text Features
Sources of Text |
micro P |
micro R |
micro F1 |
Body |
0.47 |
0.34 |
0.39 |
Body + Metatags |
0.55 |
0.34 |
0.42 |
Metatags |
0.64 |
0.39 |
0.48 |
Using metatags as the only source of text features resulted in the
most accurate classifications. Precision decreases noticeably when only
the body text is used. It is interesting that including the body text along
with the metatags also results in less accurate classifications. The usefulness
of metadata as a source of high quality text features should not be surprising
since it meets most of the criteria listed in 2.2.
In our second experiment we compared classifiers constructed from the two
different training sets described in section 3.4.
The results are shown in Table 4.
Table 4: Performance vs. Training Set
Classifier |
micro P |
micro R |
micro F1 |
macro P |
macro R |
macro F1 |
SEC-NAICS |
0.66 |
0.35 |
0.45 |
0.23 |
0.18 |
0.09 |
Web Pages |
0.71 |
0.75 |
0.73 |
0.70 |
0.37 |
0.40 |
The SEC-NAICS training set achieved respectable micro-averaged scores,
but the macro-averaged scores were low. One reason for this is that this
classifier generalizes well in categories that are common to the business
and web domains (31-33, 23, 51), but has trouble with recall in categories
that are not well represented in the business domain (71, 92) and poor
precision in categories that are not as common in the web domain (54, 52,
56). The training set constructed from web site text performed better overall.
Macro-averaged recall was much lower than micro-averaged recall. This can
be partially explained by the following example. The categories Wholesale
Trade (42) and Retail Trade (44-45) have a subtle difference especially
when it comes to web page text which tends to focus on products and services
delivered rather than the Retail vs. Wholesale distinction. In our training
set, category 42 was much more common than 44-45, and the former tended
to be assigned in place of the latter, resulting in low recall for 44-45.
Other rare categories also tended to have low recall (e.g. 23, 56, 81).
5. Related Work
Some automatically constructed, large-scale web directories have been deployed
as commercial services such as Northern Light[12],
Inktomi Directory Engine[13], Thunderstone
Web Site Catalog[14]. Details
about these systems are generally unavailable because of their proprietary
nature. It is interesting that these directories tend not to be as popular
as their manually constructed counterparts. A system for automated discovery
and classification of domain specific web resources is described as part
of the DESIRE II project[15][16].
Their classification algorithm weights terms from metatags higher than
titles and headings, which are weighted higher than plain body text. They
also describe the use of classification software as a topic filter for
harvesting a subject specific web index. Another system, Pharos (part of
the Alexandria Digital Library Project), is a scalable architecture for
searching heterogeneous information sources that leverages the use of metadata[17]
and automated classification[18].
The hyperlink structure of the web can be exploited for automated classification
by using the anchor text and other context from linking documents as a
source of text features[19]. Approaches
to efficient web spidering[20][21]
have been investigated and are especially important for very large-scale
crawling efforts. A complete system for automatically building searchable
databases of domain specific web resources using a combination of techniques
such as automated classification, targeted spidering, and information extraction
is described in reference[22].
6. Conclusions
Automated methods of knowledge discovery, including classification, will
be important for establishing the semantic web. Classification is not objective.
A single classification can never be adequate for all the possible applications.
A specialized approach including pragmatic, targeted techniques can be
applied to specific classification tasks. In this paper we described a
practical system for classifying domain names into industry categories
that gives good results. From the results in Table 3
we concluded that metatags were the best source of quality text features,
at least compared to the body text. However by limiting ourselves to metatags
we would not be able to classify the large majority web sites. Therefore
we opted for a targeted spider that extracted metatag text first, looked
for pages that described business activities, and then degraded to other
text only if necessary. It seems clear that text contained in structured
metadata fields results in better automated categorization. If the web
moves toward a more formal semantic structure as outlined by Tim Berners-Lee[1],
then automated methods can benefit. If more and different kinds of automated
classification tasks can be accomplished more accurately, the web can be
made to be more useful as well as more usable. We outline a basic approach
for building a targeted automated web categorization solution:
-
Knowledge Gathering - It is important to have a clear understanding
of the domain to be classified and the quality of the content involved.
The web is a heterogeneous environment, but within given domains patterns
and commonalties can emerge. Taking advantage of specialized knowledge
can improve classification results.
-
Targeted Spidering - For each classification task different features
will be important. However, due to the lack of homogeneity in web content,
the existence of key features can be quite inconsistent. A targeted spidering
approach tries to gather as many key features as possible with as little
effort as possible. In the future this type of approach can benefit greatly
from a web structure that encourages the use of metadata and semantically-typed
links.
-
Training - The best training data comes from the domain to be classified,
since that gives the best chance for identifying the key features. In cases
where it's not feasible to assemble enough training data in the target
domain, it may be possible to achieve acceptable results using training
data gathered from a different domain. This can be true for web content
which can be unstructured, uncontrolled, immense, and hence difficult to
assemble quality training data. However, controlled collections of pre-classfied
electronic documents can be obtained in many important domains (financial,
legal, medical, etc.) and applied to automated categorization of web content.
-
Classification - In addition to being as accurate as possible, the
classification method needs to be efficient, scalable, robust, and tolerant
of noisy data. Classification algorithms that utilize the link structure
of the web, including formalized semantic linking structures should be
further investigated.
Better acceptance of metadata is one key to the future of the semantic
web. However, creation of quality metadata is tedious and is itself a prime
candidate for automated methods. A preliminary method such as the one outlined
in the paper can serve as the basis for bootstrapping[23]
a more sophisticated classifier that takes full advantage of the semantic
web, and so on.
Acknowledgments
I would like to thank Roger Avedon, Mark Butler, and Ron Daniel for collaboration
on the design of the system, and Bill Wohler for collaboration on system
design and software implementation. Special thanks to Network Solutions
for providing classified domain names.
Bibliography
-
1
-
T. Berners-Lee. Semantic Web Road Map.
http://www.w3.org/DesignIssues/Semantic.html, 1998.
-
2
-
Yahoo!, http://www.yahoo.com/
-
3
-
Looksmart, http://www.looksmart.com/
-
4
-
Open Directory Project, http://www.dmoz.org/
-
5
-
D. Lewis. Text Representation for Intelligent Text Retrieval: A Classification-Oriented
View. In P. Jacobs, editor,
Text-Based Intelligent Systems, Chapter
9. Lawrence Erlbaum, 1992.
-
6
-
North American Industrial Classification System (NAICS) - United States,
1997.
http://www.census.gov/epcd/www/naics.html
-
7
-
R. Dolin, J. Pierre, M. Butler, and R. Avedon. Practical Evaluation of
IR within Automated Classification Systems. Eighth International Conference
of Information and Knowledge Management, 1999.
-
8
-
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing
by latent semantic analysis. Journal of the American Society for Information
Science, 41 (6):391-407, 1990.
-
9
-
C.J. van Rijsbergen. Information Retrieval. Butterworths, London,
1979.
-
10
-
D. Lewis. Evaluating Text Categorization. In Proceedings of the Speech
and Natural Language Workshop, 312-318, Morgan Kaufmann 1991.
-
11
-
Y. Yang and X. Liu. A re-examination of text categorization methods. In
Proceedings of the 22nd Annual ACM SIGIR Conference on Research and
Development in Information Retrieval, 42-49, 1999.
-
12
-
Northern Light, http://www.northernlight.com/
-
13
-
Inktomi Directory Engine,
http://www.inktomi.com/products/portal/directory/
-
14
-
Thunderstone Web Site Catalog,
http://search.thunderstone.com/texis/websearch/about.html
-
15
-
A. Ardo, T. Koch, and L. Nooden. The construction of a robot-generated
subject index. EU Project DESIRE II D3.6a, Working Paper 1 1999.
http://www.lub.lu.se/desire/DESIRE36a-WP1.html
-
16
-
T. Kock and A. Ardo. Automatic classification of full-text HTML-documents
from one specific subject area. EU Project DESIRE II D3.6a, Working
Paper 2 2000.
http://www.lub.lu.se/desire/DESIRE36a-WP2.html
-
17
-
R. Dolin, D. Agrawal, L. Dillon, and A. El Abbadi. Pharos: A Scalable Distributed
Architecture for Locating Heterogeneous Information Sources Version. In
In Proceedings of the 6th International Conference on Information and
Knowledge Management, 1997.
-
18
-
R. Dolin, D. Agrawal, A. El Abbadi, and J. Pearlman. Using Automated Classification
for Summarizing and Selecting Heterogeneous Information Sources. In D-Lib
Magazine, January, 1998.
-
19
-
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization
by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone
(eds.),
Proceedings of THAI'99, European Symposium on Telematics, Hypermedia
and Artificial Intelligence, 105-119, 1999.
-
20
-
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through URL ordering.
In Computer Networks and ISDN Systems (WWW7), Vol. 30, 1998.
-
21
-
J. Rennie and A. McCallum. Using Reinforcement Learning to Spider the Web
Efficiently. Proceedings of the Sixteenth International Conference on
Machine Learning, 1999.
-
22
-
A. McCallum, K. Nigam, J. Rennie, and K. Seymore. A Machine Learning Approach
to Building Domain-Specific Search Engines.
The Sixteenth International
Joint Conference on Artificial Intelligence, 1999.
-
23
-
R. Jones, A. McCallum, K. Nigam, and E. Riloff. Bootstrapping for Text
Learning Tasks. In IJCAI-99 Workshop on Text Mining: Foundations, Techniques
and Applications, 52-63, 1999.
Footnotes
-
... areas1
-
The possibilities for misuse/abuse of these tags to improve search engine
rankings are well known; however, we found these practices to be not very
widespread in our sample and of little consequence.
-
... category2
-
Industry classifications were provided by InfoUSA and Dunn & Bradstreet.
-
... filings3
-
SEC 10-K filings are annual reports required of all U.S. public companies
that describe business activities for the year. Each public company is
also assigned an SIC category.
2000-08-30